Before We Start
To follow along, you’ll need to have completed Get Started with Tracing which means we have:- Financial Analysis and Research Chatbot
- Trace Data in Phoenix
Step 1: Make Sure You Have Data in Phoenix
Before we can run evaluations, we need something to evaluate. Evaluations in Phoenix run over existing trace data. If you followed the tracing guide, you should already have:- A project in Phoenix
- Traces containing LLM inputs and outputs
Step 2: Define an Evaluation
Now that we have trace data, the next question is how we decide whether an output is actually good. An evaluation makes that decision explicit. Instead of manually inspecting outputs or relying on intuition, we define a rule that Phoenix can apply consistently across many runs. In Phoenix, evaluations can be written in different ways. In this guide, we’ll use an LLM-as-a-judge evaluation as a simple starting point. This works well for questions like correctness or relevance, and lets us get metrics quickly. (If you’d rather use code-based evaluations, you can follow the guide on setting those up.) For LLM-as-a-judge evaluations, that means defining three things:- A prompt that describes the judgment criteria
- An LLM that performs the evaluation
- The data we want to score
Define the Evaluation Prompt
We’ll start by defining the prompt that tells the evaluator how to judge an answer. We’re usingattributes.input.value & attributes.output.value as that is how our span data saves input & output.
Define the LLM Judge
Next, we’ll define the model that will act as the judge.Create the Evaluator
Now we can combine the prompt and model into an evaluator.Step 3: Run the Evaluation
Next, we’ll pull our trace data from Phoenix and run the evaluator on it. First, fetch the spans we want to evaluate:suppress_tracing() since auto-instrumentation enabled and we do not want to trace each of these OpenAI evaluation calls in our project.
Step 4: Log Evaluation Results to Phoenix
Finally, we’ll log the evaluation results back to Phoenix so they show up alongside our traces in the UI. This is what makes evaluations useful beyond a single run. Instead of living only in code, results become part of the same view you already use to understand behavior.Tracing Project with Evaluation Annotations

