Skip to main content
In this guide, we’ll set up evaluations in Phoenix so we can measure the quality of model outputs from a real application. Traces tell us what happened during a run, but they don’t tell us whether the output was good. Evaluations fill that gap by letting us score outputs in a consistent, repeatable way. We’ll start with data that already exists in Phoenix, define a simple evaluation, and run it so we can see results directly in the UI. The goal is to move from “I have model outputs” to “I can measure quality in a repeatable way.” Since we already have traces, we can take this a step further by scoring them against metrics like correctness, relevance, or custom checks that matter to your use case.

Before We Start

To follow along, you’ll need to have completed Get Started with Tracing which means we have:
  • Financial Analysis and Research Chatbot
  • Trace Data in Phoenix

Step 1: Make Sure You Have Data in Phoenix

Before we can run evaluations, we need something to evaluate. Evaluations in Phoenix run over existing trace data. If you followed the tracing guide, you should already have:
  • A project in Phoenix
  • Traces containing LLM inputs and outputs
It’s best to have multiple traces so we can see how evaluation results vary from run to run. If needed, run your agent a few times with different inputs to generate more data.
test_queries = [
    {"tickers": "AAPL", "focus": "financial analysis and market outlook"},
    {"tickers": "NVDA", "focus": "valuation metrics and growth prospects"},
    {"tickers": "AMZN", "focus": "profitability and market share"},
    {"tickers": "AAPL, MSFT", "focus": "comparative financial analysis"},
    {"tickers": "META, SNAP, PINS", "focus": "social media sector trends"},
    {"tickers": "RIVN", "focus": "financial health and viability"},
    {"tickers": "SNOW", "focus": "revenue growth trajectory"},
    {"tickers": "KO", "focus": "dividend yield and stability"},
    {"tickers": "META", "focus": "latest developments and stock performance"},
    {"tickers": "AAPL, MSFT, GOOGL, AMZN, META", "focus": "big tech comparison and market outlook"},
    {"tickers": "AMC", "focus": "financial analysis and market sentiment"},
]

for query in test_queries:
    crew.kickoff(inputs=query)

Step 2: Define an Evaluation

Now that we have trace data, the next question is how we decide whether an output is actually good. An evaluation makes that decision explicit. Instead of manually inspecting outputs or relying on intuition, we define a rule that Phoenix can apply consistently across many runs. In Phoenix, evaluations can be written in different ways. In this guide, we’ll use an LLM-as-a-judge evaluation as a simple starting point. This works well for questions like correctness or relevance, and lets us get metrics quickly. (If you’d rather use code-based evaluations, you can follow the guide on setting those up.) For LLM-as-a-judge evaluations, that means defining three things:
  • A prompt that describes the judgment criteria
  • An LLM that performs the evaluation
  • The data we want to score
In this step, we’ll define a basic completeness evaluation that checks whether the agent’s output completely answers the input. Phoenix also provides pre-built evaluation templates you can use or adapt for other metrics like relevance or hallucinations.

Define the Evaluation Prompt

We’ll start by defining the prompt that tells the evaluator how to judge an answer. We’re using attributes.input.value & attributes.output.value as that is how our span data saves input & output.
financial_completeness_template = """
You are evaluating whether a financial research report correctly completes ALL parts of the user's task with COMPREHENSIVE coverage.

User input: {attributes.input.value}

Generated report: {attributes.output.value}

To be marked as "complete", the report MUST meet ALL of these strict requirements:

1. TICKER COVERAGE (MANDATORY):
   - Cover ALL companies/tickers mentioned in the input
   - If multiple tickers are listed, EACH must have dedicated analysis (not just mentioned in passing)
   - For multiple tickers, the report must provide COMPARATIVE analysis when relevant

2. FOCUS AREA COVERAGE (MANDATORY):
   - Address ALL focus areas mentioned in the input
   - If the focus mentions multiple topics (e.g., "earnings and outlook"), BOTH must be thoroughly addressed
   - Each focus area must have substantial content, not just a brief mention

3. FINANCIAL DATA REQUIREMENTS (MANDATORY):
   - For EACH ticker, the report must include:
     * Current/recent stock price or performance data
     * At least 2 key financial ratios (P/E, P/B, debt-to-equity, ROE, etc.)
     * Revenue or earnings information
     * Recent news or developments (within last 6 months)
   - If focus mentions specific metrics (e.g., "P/E ratio"), those MUST be explicitly provided

4. DEPTH REQUIREMENT (MANDATORY):
   - Each ticker must have at least 3-4 sentences of dedicated analysis
   - Generic statements without specific data do NOT count
   - The report must demonstrate thorough research, not superficial coverage

5. COMPARISON REQUIREMENT (if multiple tickers):
   - If 2+ tickers are requested, the report MUST include direct comparisons
   - Comparisons should cover multiple key metrics side-by-side
   - Generic statements like "both companies are good" do NOT satisfy this requirement
   - Must explicitly state which company performs better/worse on specific metrics

The report is "incomplete" if it fails ANY of the above requirements, including:
- Missing any ticker or only mentioning it briefly
- Failing to address any focus area or only addressing it superficially
- Missing required financial data for any ticker
- Providing generic analysis without specific metrics or data
- Failing to provide comparisons when multiple tickers are requested
- Not meeting the depth requirement for any ticker

Respond with ONLY one word: "complete" or "incomplete"
Then provide a detailed explanation of which specific requirements were met or failed.
"""
This prompt defines what correctness means for our application.

Define the LLM Judge

Next, we’ll define the model that will act as the judge.
from phoenix.evals import LLM

llm = LLM(model="gpt-4o", provider="openai")

Create the Evaluator

Now we can combine the prompt and model into an evaluator.
from phoenix.evals import create_classifier

completeness_evaluator = create_classifier(
    name="completeness",
    prompt_template=financial_completeness_template,
    llm=llm,
    choices={"complete": 1.0, "incomplete": 0.0},
)
At this point, we’ve defined how Phoenix should evaluate correctness, but we haven’t run it yet.

Step 3: Run the Evaluation

Next, we’ll pull our trace data from Phoenix and run the evaluator on it. First, fetch the spans we want to evaluate:
from phoenix.client import Client

px_client = Client()
df = px_client.spans.get_spans_dataframe(project_name="crewai-tracing-quickstart")
parent_spans = df[df["span_kind"] == "CHAIN"]
Then run the evaluator over that data. We are adding in suppress_tracing() since auto-instrumentation enabled and we do not want to trace each of these OpenAI evaluation calls in our project.
from phoenix.evals import evaluate_dataframe
from phoenix.trace import suppress_tracing

with suppress_tracing():
    results_df = evaluate_dataframe(
        dataframe=parent_spans,
        evaluators=[completeness_evaluator]
    )
This produces evaluation results for each span in the dataset.

Step 4: Log Evaluation Results to Phoenix

Finally, we’ll log the evaluation results back to Phoenix so they show up alongside our traces in the UI. This is what makes evaluations useful beyond a single run. Instead of living only in code, results become part of the same view you already use to understand behavior.
from phoenix.evals.utils import to_annotation_dataframe

evaluations = to_annotation_dataframe(
    dataframe=results_df
)

Client().spans.log_span_annotations_dataframe(
    dataframe=evaluations
)
Once this completes, head back to Phoenix. You’ll now see evaluation results attached to your trace data in the annotations column, making it easy to understand which runs passed, which failed, and how quality varies across executions.

Tracing Project with Evaluation Annotations

Congratulations! You’ve run your first evaluation in Phoenix.

Learn More About Evals

Now that you have evaluation results in Phoenix, you can start using them to guide iteration. You can group traces with an incorrect label into a dataset, make changes to prompts or logic, and then run experiments on the same inputs to compare how outputs differ. The easiest and fastest way to make iterations to your application with no code is through prompt playground. The Iterate on Your Prompts guide walks through this workflow in more detail. To go deeper on evaluations, the Evaluations Tutorial covers writing more nuanced evaluators, using different scoring strategies, and comparing quality across runs as your application evolves. This was a simple example, but evaluations in Phoenix can support much more advanced workflows over time.