Optimize Your App with Experiments

An experiment is a structured comparison between versions of your application using the same inputs and evaluation criteria. In this guide, you pull down an existing dataset and run experiments in code to compare different versions of your application using the same inputs and evaluation criteria. This makes it possible to test changes and verify whether they actually improve quality. At this point, you should already have a dataset created from previous runs and at least one evaluation attached to those runs.

Before We Start

To follow along, you should already have:

Traces & Evals attached to a project in Phoenix
A dataset created from previous runs, such as failed traces

Follow along with code: This guide has a companion notebook with runnable code examples. Find it here.

Step 1: Use Explanations to Identify Improvements

We’ll be using our dataset to group our application failures together – the next step is deciding which issues to fix. Using the explanations apart of the evals we ran previously & trace context, we can understand why these runs failed. Looking at the traces in this dataset, you might notice patterns such as unclear instructions, missing constraints, or outputs that don’t follow the expected structure. The easiest way to see these is to go back into the trace view for these failed runs & read the explanations for why they were each labeled as “incomplete” answers. In this example, we’ll improve the agent by strengthening the agents’ backstory instructions so the model has clearer guidance on what a good response looks like.

Update the Agent Instructions

Below is an example of tightening the agent backstory to be more explicit about the expected output.

researcher = Agent(
        role="Financial Research Analyst",
        goal="""Gather up-to-date financial data, trends, and news for the target companies/markets.
        Make sure to include more than 1 financial ratios (such as P/E or P/B), news from the last 6 months, and current stock price or performance data.""",
        backstory="""
            You are a Senior Financial Research Analyst.
        """,
        verbose=True,
        allow_delegation=False,
        max_iter=2,
        tools=[search_tool],
        )

writer = Agent(
    role="Financial Report Writer",
    goal="Compile and summarize financial research into clear, actionable insights. If there are multiple tickers, make sure to include a dedicated comparison section.",
    backstory="""
        You are an experienced financial content writer.
    """,
    verbose=True,
    allow_delegation=True,
    max_iter=1
)

Create an updated crew with these new, updated agents.

updated_crew = Crew(
	agents=[researcher, writer],
    tasks=[task1, task2],
    verbose=True,
    process=Process.sequential,
)

At this point, we’ve made a targeted change based on the explanations for why traces were classified as real failures.

Step 2: Define an Experiment

Now that we’ve updated the agent, we will run this new Crew to test whether the changes actually improve quality. Experiments in Phoenix let you rerun the same inputs through different versions of your application and compare the results side by side. This helps ensure that improvements are measured, not assumed. To define an experiment, we need to specify:

The experiment task A task is a function or process that takes each example from a dataset and produces an output, typically by running your application logic or model on the input.
The experiment evaluation An experiment evaluation is essentially the same as a regular evaluation, but specifically assesses the quality of a task’s output, often by comparing it to an expected result or applying a scoring metric.

In this guide, the task for the experiment is simply to rerun the agent using the updated instructions to see improvements. Since we are re-running our agent system on these inputs and getting new outputs, we will rerun the same evaluation to directly compare results.

def my_task(example):
    result = updated_crew.kickoff(inputs=example.input)
    return result

from phoenix.evals import ClassificationEvaluator

completeness_evaluator = ClassificationEvaluator(
    name="completeness",
    prompt_template=financial_completeness_template,
    llm=llm,
    choices={"complete": 1.0, "incomplete": 0.0},
)

def completeness(input, output):
    results = completeness_evaluator.evaluate({"attributes.input.value": input, "attributes.output.value": output})
    return results[0].label

evaluators = [completeness]

Step 3: Run the Experiment on the Dataset

Next, we’ll pull down the dataset we created earlier and run the experiment on it. This ensures we’re testing the new version of the agent on the exact same inputs that previously failed.

dataset = Client().datasets.get_dataset(dataset="python quickstart fails")

Now we can run our experiment!

from phoenix.experiments import run_experiment

experiment = run_experiment(dataset, my_task, evaluators)

Once this completes, Phoenix logs the experiment results automatically.

Step 4: View Experiment Results in Phoenix

Head back to Phoenix and open the Experiments view. Here, you can see:

The original runs compared against the new ones under ‘reference output’
New application runs as a results of our task
Evaluation results for each version

In this example, we should see more traces receiving a correct label, indicating that the changes improved performance.

Congratulations! You’ve created your first dataset and ran your first experiment in Phoenix.

Learn More About Datasets and Experiments

This was a simple example, but datasets and experiments can support much more advanced workflows. If you want to test prompt changes to a specific part of your application and keep track of different prompt versions, the Prompt Playground guide walks through how to do that. To go deeper with datasets and experiments, you can build datasets for specific user segments or edge cases, compare multiple prompt or model variants, and track quality improvements over time as your application evolves. The Datasets and Experiments section covers these patterns in more detail.

Quick Start

Tracing

Evaluation

Datasets & Experiments

Prompts

Settings

Concepts

Resources

Optimize Your App with Experiments

Before We Start

Step 1: Use Explanations to Identify Improvements

Update the Agent Instructions

Step 2: Define an Experiment

Step 3: Run the Experiment on the Dataset

Step 4: View Experiment Results in Phoenix

Learn More About Datasets and Experiments

Quick Start

Tracing

Evaluation

Datasets & Experiments

Prompts

Settings

Concepts

Resources

​Before We Start

​Step 1: Use Explanations to Identify Improvements

​Update the Agent Instructions

​Step 2: Define an Experiment

​Step 3: Run the Experiment on the Dataset

​Step 4: View Experiment Results in Phoenix

​Learn More About Datasets and Experiments

Before We Start

Step 1: Use Explanations to Identify Improvements

Update the Agent Instructions

Step 2: Define an Experiment

Step 3: Run the Experiment on the Dataset

Step 4: View Experiment Results in Phoenix

Learn More About Datasets and Experiments