Before We Start
To follow along, you should already have:- Traces & Evals attached to a project in Phoenix
- A dataset created from previous runs, such as failed traces
Step 1: Use Explanations to Identify Improvements
We’ll be using our dataset to group our application failures together – the next step is deciding which issues to fix. Using the explanations apart of the evals we ran previously & trace context, we can understand why these runs failed. Looking at the traces in this dataset, you might notice patterns such as unclear instructions, missing constraints, or outputs that don’t follow the expected structure. The easiest way to see these is to go back into the trace view for these failed runs & read the explanations for why they were each labeled as “incomplete” answers. In this example, we’ll improve the agent by strengthening the agents’ backstory instructions so the model has clearer guidance on what a good response looks like.Update the Agent Instructions
Below is an example of tightening the agent backstory to be more explicit about the expected output.Step 2: Define an Experiment
Now that we’ve updated the agent, we will run this new Crew to test whether the changes actually improve quality. Experiments in Phoenix let you rerun the same inputs through different versions of your application and compare the results side by side. This helps ensure that improvements are measured, not assumed. To define an experiment, we need to specify:- The experiment task A task is a function or process that takes each example from a dataset and produces an output, typically by running your application logic or model on the input.
- The experiment evaluation An experiment evaluation is essentially the same as a regular evaluation, but specifically assesses the quality of a task’s output, often by comparing it to an expected result or applying a scoring metric.
Step 3: Run the Experiment on the Dataset
Next, we’ll pull down the dataset we created earlier and run the experiment on it. This ensures we’re testing the new version of the agent on the exact same inputs that previously failed.Step 4: View Experiment Results in Phoenix
Head back to Phoenix and open the Experiments view. Here, you can see:- The original runs compared against the new ones under ‘reference output’
- New application runs as a reults of our task
- Evaluation results for each version
View Experiment Results

