This guide walks through a complete workflow for understanding and improving an agent application using Phoenix.The goal is not just to run an application, but to understand how it behaves, determine whether its outputs are correct, and make changes that can be tested and verified. Each guide in this series introduces one piece of that workflow and builds on the previous one.
Tracing answers a basic question: what is happening under the hood of my application when it runs?A trace is a record of a single run of your application, broken down into spans that show what happened at each step. In this guide, you instrument an application and send trace data to Phoenix. Traces show how agents, tasks, and tools executed during a run, and provide the raw data needed for everything that follows.This gives you an end-to-end view of execution that is difficult to reconstruct from logs alone.
Once you can see what happened, the next question is whether the output was correct.An evaluation produces a score or label for an output, so you can track quality across runs. In this guide, you define evaluations and run them on existing trace data. Evaluations attach quality signals to runs so that correctness or relevance can be reasoned about consistently instead of judged case by case.This turns traces from observations into something you can measure.
Prompt changes often have a direct impact on application behavior.A prompt is the set of instructions and context sent to the model to produce an output. In this guide, you start from real prompts captured during executions, group failing runs into a dataset, and use the Prompt Playground to iterate on prompt variants. Prompt Hub is used to save and reuse prompts across runs.This lets you evaluate prompt changes using the same data and criteria instead of relying on spot checks.
Once you have changes you want to test, experiments let you compare versions in a controlled way.An experiment is a structured comparison between versions of your application using the same inputs and evaluation criteria. In this guide, you pull down an existing dataset and run experiments in code to compare different versions of your application using the same inputs and evaluation criteria. This makes it possible to test changes and verify whether they actually improve quality.This lets you compare versions using the exact same inputs, so differences in results come from your changes.
If you are new to Phoenix, start with Get Started with Tracing and follow the guides in order. Each step assumes the previous one is in place.Taken together, these guides describe a single workflow for understanding behavior, measuring quality, and improving an application in a controlled way.