Creating a Custom LLM Evaluator with a Benchmark Dataset
Learn how to build a custom LLM-as-a-Judge evaluator by creating a benchmark dataset tailored to your use case, enabling rigorous evaluation beyond standard templates.
In this tutorial, you’ll learn how to build a custom LLM-as-a-Judge Evaluator tailored to your specific use case. While Phoenix provides several pre-built evaluators that have been tested against benchmark datasets, these may not always cover the nuances of your application.So how can you achieve the same level of rigor when your use case falls outside the scope of standard evaluators?We’ll walk through how to create your own benchmark dataset using a small set of human annotated examples. This dataset will allow you to build and refine a custom evaluator by revealing failure cases and guiding iteration.The diagram below provides an overview of the process we will follow in this walkthrough.
In this tutorial, we’ll ask an LLM to generate expense reports from receipt images provided as public URLs. Running the cells below will generate traces, which you can explore directly in Phoenix for annotation. We’ll use GPT-4.1, which supports image inputs.
from openai import OpenAIclient = OpenAI()def extract_receipt_data(input): response = client.chat.completions.create( model="gpt-4.1", messages=[ { "role": "user", "content": [ {"type": "text", "text": "Analyze this receipt and return a brief summary for an expense report. Only include category of expense, total cost, and summary of items"}, { "type": "image_url", "image_url": { "url": input, }, }, ], } ], max_tokens=500, ) return response
By following the auto-instrumentation setup, running the code below will automatically send traces to Phoenix.
After generating traces, open Phoenix to begin annotating your dataset. In this example, we’ll annotate based on “accuracy”, but you can choose any evaluation criterion that fits your use case. Just be sure to update the query below to match the annotation key you’re using—this ensures the annotated examples are included in your benchmark dataset.
import pandas as pdimport phoenix as pxfrom phoenix.client import Clientfrom phoenix.client.types import spansclient = Client(api_key=os.getenv("PHOENIX_API_KEY"))# replace "accuracy" if you chose to annotate on different criteriaquery = spans.SpanQuery().where("annotations['accuracy']")spans_df = client.spans.get_spans_dataframe(query=query, project_identifier="receipt-classification")annotations_df = client.spans.get_span_annotations_dataframe(spans_dataframe = spans_df, project_identifier="receipt-classification")full_df = annotations_df.join(spans_df, how = "inner")from phoenix.client import Clientdataset = Client().datasets.create_dataset( name="annotated-receipts", dataframe=full_df, input_keys=["attributes.input.value"], output_keys=["attributes.llm.output_messages"], metadata_keys=["result.label", "result.score", "result.explanation"],)
Next, we’ll create a baseline evaluation template and define both the task and the evaluation function. Once these are set up, we’ll run an experiment to compare the evaluator’s performance against our ground truth annotations. In this case, our task function calls evaluator.evaluate() directly with a ClassificationEvaluator and our evaluator is a comparison between the task output and our annotated labels.
Phoenix evals does not yet support multimodal inputs (e.g. images). The evaluator below assesses the expense report text output for completeness and structure rather than comparing against the original receipt image.
choices = ["accurate", "almost accurate", "inaccurate"]prompt_template = """You are an evaluator tasked with assessing the quality of a model-generated expense report.The model was instructed to analyze a receipt image and return a brief summary including: category of expense, total cost, and summary of items.---MODEL OUTPUT (Expense Report): {output}---Evaluate whether the expense report is complete and well-structured. Assign one of the following labels. Only include the label:- **"accurate"** – Includes expense category, total cost, and item summary; all information looks reasonable- **"almost accurate"** – Mostly correct but with small issues (e.g., missing one element or vague category)- **"inaccurate"** – Substantially wrong or missing critical information"""
Next, we’ll refine our evaluation prompt template by adding more specific instructions to classification rules. We can add these rules based on gaps we saw in the previous iteration. This additional guidance helps improve accuracy and ensures the evaluator’s judgments better align with human expectations.
prompt_template = """You are an evaluator tasked with assessing the quality of a model-generated expense report.The model was instructed to analyze a receipt image and return a brief summary including: category of expense, total cost, and summary of items.---MODEL OUTPUT (Expense Report): {output}---Evaluate the following and assign one of the following labels. Only include the label:- **"accurate"** – Total price, itemized list, and expense category are all present and look reasonable. All three must be present to get this label.- **"almost accurate"** – Mostly correct but with small issues. For example, expense category is too vague or one element is missing.- **"inaccurate"** – Substantially wrong or missing critical information. For example, missing total price entirely."""receipt_evaluator = ClassificationEvaluator( name="receipt_accuracy", prompt_template=prompt_template, llm=llm, choices=choices,)
Once your evaluator reaches a performance level you’re satisfied with, it’s ready for use. The target score will depend on your benchmark dataset and specific use case. You can define different thresholds and metrics you hope the evaluator to achieve.That said, you can continue applying the techniques from this tutorial to refine and iterate until the evaluator meets your desired level of quality.