
Features
Phoenix Evals provides lightweight, composable building blocks for writing and running evaluations on LLM applications. It can be installed completely independently of thearize-phoenix package and is available in both Python and TypeScript versions.
- Works with your preferred model SDKs via SDK adapters (OpenAI, LiteLLM, LangChain, AI SDK) - Phoenix lets you configure which foundation model you’d like to use as a judge. This includes OpenAI, Anthropic, Gemini, and much more. See Configuring the LLM.
- Powerful input mapping and binding for working with complex data structures - easily map nested data and complex inputs to evaluator requirements.
- Several pre-built metrics for common evaluation tasks like hallucination detection - Phoenix provides pre-tested eval templates for common tasks such as RAG and function calling. Learn more about pretested templates here. Each eval is pre-tested on a variety of eval models. Find the most up-to-date benchmarks on GitHub.
- Evaluators are natively instrumented via OpenTelemetry tracing for observability and dataset curation. See Evaluator Traces for an overview.
- Blazing fast performance - achieve up to 20x speedup with built-in concurrency and batching. Evals run in batches and typically run much faster than calling the APIs directly. See Executors for details on how this works.
- Tons of convenience features to improve the developer experience!
- Run evals on your own data - comes with native dataframe and data transformation utilities, making it easy to run evaluations on your own data—whether that’s logs, traces, or datasets downloaded for benchmarking.
- Built-in Explanations - All Phoenix evaluations include an explanation capability that requires eval models to explain their judgment rationale. This boosts performance and helps you understand and improve your eval.
LLM as a Judge
Learn how LLM-based evaluation works and best practices
Pre-built Evaluators
Use pre-tested evaluators for hallucinations, relevance, toxicity, and more
Custom Evaluators
Build custom evaluators tailored to your use case
Datasets & Experiments
Run evaluations systematically with datasets and experiments

