Before We Start
After completing previous guides, you should have:- Traces flowing into Phoenix
- At least one evaluation defined and logged
Step 1: Create a Dataset from Failed Traces
We’ll start by grouping together traces that didn’t perform well. Datasets let us collect a specific set of traces so we can analyze them together and reuse them later for testing. In this guide, we’ll create a dataset from traces that received an incompleteness evaluation label. This gives us a concrete set of failures to focus on and makes it easier to test whether future changes actually fix them. You can create datasets in code, but for this walkthrough we’ll use the Phoenix UI. If you’d like to create datasets programmatically, you can follow the Create Datasets guide.Create a Dataset in the UI
- Navigate to your project in Phoenix.
-
Filter your traces by the incorrect evaluation label.
evals['completeness'].label == 'incomplete' - Select the traces you want to include.
- Click Create Dataset and give it a name.
- Add the selected traces to the dataset you just created.
Create a New Dataset & add Examples
Step 2: Save a Prompt from a Trace
We’ll start from a real prompt that was actually used by the application. Traces capture the exact prompts sent to the model, along with their context and outputs. Saving a prompt from a trace lets us iterate on something real, rather than starting from a blank page.Save a Prompt from the Trace View
- Navigate to your project in Phoenix.
- Open the Traces view and click into a trace.
- Find a span that contains a prompt.
- Save the prompt to the Prompt Hub.
Save the Financial Research Agent Prompt
Step 3: Run the Prompt in Prompt Playground
Next, we’ll bring that saved prompt into the Prompt Playground. The playground lets us run prompts against a dataset of inputs so we can see how a prompt behaves across many examples, not just one.Run the Prompt Against a Dataset
- Navigate to the Prompt Playground.
- Select the prompt you just saved from the Prompt Hub.
- Choose the dataset you just created.
-
Modify the User Prompt to accept the inputs of the dataset.
The Start of the User prompt should look like this:
- Run the prompt across the dataset.
Run the Saved Prompt on your Dataset of Failed Traces
Step 4: Create and Save a New Prompt Variant
Now that we have a baseline, we can make a change. In this step, we’ll modify the prompt in the playground to address issues we saw in previous runs, such as unclear instructions or missing constraints. To understand why our evaluations came to a specific score, click into a trace and under the annotations column we can see the explanations of our evaluations. Using these explanations, we can notice that many times the reason our agent run was labeled incomplete is due to the lack of financial ratios in our report — so we can go ahead and add that into our prompt.Add a New Prompt Variant
-
Update the prompt directly in the playground. Add this line in:
Make sure to include more than 1 financial ratio (such as P/E or P/B).
- Run it to preview how outputs change.
- Save the new version as a separate prompt in the Prompt Hub.
Update Prompt & Run a Comparison in Prompt Playground
Step 5: Compare Prompts Using Experiments
Once we have multiple prompt versions, we want to compare them in a structured way to see if each prompt will result in unique results. Since you just ran the prompt playground with both your prompts, you can see them side by side in the experiment view. In this step, we’ll just navigate to the experiments page and take a look at the runs we just made by using the prompt playground in this guide.- Navigate to the Datasets Page & click on the dataset we made earlier in this guide.
- You should see 3 experiment runs, the first being a results of step 3 and the two most recent being from our new prompt comparison run.
- Click on the second one & at the top of the page, under comparison select the #3 experiment.
View Prompt Playground Runs through Experiment View

