Skip to main content
In this guide, we’ll run experiments in Phoenix to systematically improve an application. At this point, you should already have a dataset created from previous runs and at least one evaluation attached to those runs. Experiments let you rerun the same dataset through an updated version of your application and compare results side by side.

Before We Start

To follow along, you should already have:
  • Traces & Evals attached to a project in Phoenix
  • A dataset created from previous runs, such as failed traces

Step 1: Use Explanations to Identify Improvements

We’ll be using our dataset to group our application failures together – the next step is deciding which issues to fix. Using the explanations apart of the evals we ran previously & trace context, we can understand why these runs failed. Looking at the traces in this dataset, you might notice patterns such as unclear instructions, missing constraints, or outputs that don’t follow the expected structure. The easiest way to see these is to go back into the trace view for these failed runs & read the explanations for why they were each labeled as “incomplete” answers. In this example, we’ll improve the agent by strengthening the agents’ backstory instructions so the model has clearer guidance on what a good response looks like.

Update the Agent Instructions

Below is an example of tightening the agent backstory to be more explicit about the expected output.
researcher = Agent(
        role="Financial Research Analyst",
        goal="""Gather up-to-date financial data, trends, and news for the target companies/markets.
        Make sure to include more than 1 financial ratios (such as P/E or P/B), news from the last 6 months, and current stock price or performance data.""",
        backstory="""
            You are a Senior Financial Research Analyst.
        """,
        verbose=True,
        allow_delegation=False,
        max_iter=2,
        tools=[search_tool],
        )

writer = Agent(
    role="Financial Report Writer",
    goal="Compile and summarize financial research into clear, actionable insights. If there are multiple tickers, make sure to include a dedicated comparison section.",
    backstory="""
        You are an experienced financial content writer.
    """,
    verbose=True,
    allow_delegation=True,
    max_iter=1
)
Create an updated crew with these new, updated agents.
updated_crew = Crew(
	agents=[researcher, writer],
    tasks=[task1, task2],
    verbose=1,
    process=Process.sequential,
)
At this point, we’ve made a targeted change based on the explanations for why traces were classified as real failures.

Step 2: Define an Experiment

Now that we’ve updated the agent, we will run this new Crew to test whether the changes actually improve quality. Experiments in Phoenix let you rerun the same inputs through different versions of your application and compare the results side by side. This helps ensure that improvements are measured, not assumed. To define an experiment, we need to specify:
  • The experiment task A task is a function or process that takes each example from a dataset and produces an output, typically by running your application logic or model on the input.
  • The experiment evaluation An experiment evaluation is essentially the same as a regular evaluation, but specifically assesses the quality of a task’s output, often by comparing it to an expected result or applying a scoring metric.
In this guide, the task for the experiment is simply to rerun the agent using the updated instructions to see improvements. Since we are re-running our agent system on these inputs and getting new outputs, we will rerun the same evaluation to directly compare results.
def my_task(example):
    result = updated_crew.kickoff(inputs=example.input)
    return result
from phoenix.evals import ClassificationEvaluator

completeness_evaluator = ClassificationEvaluator(
    name="completeness",
    prompt_template=financial_completeness_template,
    llm=llm,
    choices={"complete": 1.0, "incomplete": 0.0},
)

def completeness(input, output):
    results = completeness_evaluator.evaluate({"attributes.input.value": input, "attributes.output.value": output})
    return results[0].label

evaluators = [completeness]

Step 3: Run the Experiment on the Dataset

Next, we’ll pull down the dataset we created earlier and run the experiment on it. This ensures we’re testing the new version of the agent on the exact same inputs that previously failed.
dataset = Client().datasets.get_dataset(dataset="python quickstart fails")
Now we can run our experiment!
from phoenix.experiments import run_experiment

experiment = run_experiment(dataset, my_task, evaluators)
Once this completes, Phoenix logs the experiment results automatically.

Step 4: View Experiment Results in Phoenix

Head back to Phoenix and open the Experiments view. Here, you can see:
  • The original runs compared against the new ones under ‘reference output’
  • New application runs as a reults of our task
  • Evaluation results for each version
In this example, we should see more traces receiving a correct label, indicating that the changes improved performance.

View Experiment Results

Congratulations! You’ve created your first dataset and ran your first experiment in Phoenix.

Learn More About Datasets and Experiments

This was a simple example, but datasets and experiments can support much more advanced workflows. If you want to test prompt changes to a specific part of your application and keep track of different prompt versions, the Prompt Playground guide walks through how to do that. To go deeper with datasets and experiments, you can build datasets for specific user segments or edge cases, compare multiple prompt or model variants, and track quality improvements over time as your application evolves. The Datasets and Experiments section covers these patterns in more detail.