How It Works
- Attach evaluators to a dataset — Open a dataset, navigate to the Evaluators tab, and add LLM-based or built-in code evaluators. Configure input mappings once to tell each evaluator where to find its inputs.
- Run an experiment — Execute an experiment against that dataset from the Playground. Attached evaluators run server-side automatically.
- Review scores and traces — Results appear as annotations on the experiment run. Every evaluator execution is traced in its own project so you can navigate from a score to the exact LLM call that produced it.
Evaluator Types
Built-in Code Evaluators
Deterministic evaluators that run without an LLM — Contains, Exact Match, Regex, Levenshtein Distance, and JSON Distance.
LLM Evaluators
LLM-as-a-judge evaluators backed by Phoenix-managed prompts. Use pre-built templates for common tasks like correctness and tool response handling, or write your own.
Why Use Server Evals
- Attach once, evaluate everywhere — Evaluators are defined on the dataset, not the experiment. Every Playground run against that dataset automatically records scores.
- No local setup required — Built-in evaluators run entirely server-side. LLM evaluators use the model configuration already set up on your Phoenix instance — no SDK, API keys, or local dependencies needed.
- Flexible input mapping — Map evaluator variables to any dataset field — input, output, reference, or metadata — using JSON paths for nested values.
- Full traceability — Every evaluator execution is traced in its own project. Navigate from an annotation score to the exact LLM call that produced it, making it easy to debug and refine evaluation criteria.

