Follow along with code: This guide has a companion notebook with runnable code examples. Find it here, and go to Part 3: Compare Prompt Versions.
Build Two New Prompt Versions
In Test Prompts at Scale, our experiment gave us some insights into why our prompt was underperforming - only achieving 53% accuracy. In this section, we’ll build our new version of the prompt based on this analysis.Edit Prompt Template (Version 3)
The prompt template refers to the specific text passed to your LLM. In Test Prompts at Scale, we saw that 30/71 errors came from the broad_vs_specific error type, so we built a custom instruction from this observation.- UI
- Python
- TypeScript
Edit Prompt Parameters (Version 4)
In Phoenix, Prompt Objects are more than just the Prompt Template - they include other parameters that can have huge impacts on the success of your prompt. In this section, we’ll upload another Prompt Version, this one with adjusted model parameters, so we can later test it out. Here are common prompt parameters:- Model Choice (GPT-4.1, Claude Sonnet 4.5, Gemini 3, etc.) – Different models vary in reasoning depth, instruction-following ability, speed, and cost; selecting the right one can dramatically affect accuracy, latency, and overall cost.
- Temperature – Lower values make responses more consistent and deterministic; higher values increase variety and creativity.
- Top-p / Top-k – Control how many token options the model considers when generating text; useful for balancing precision and diversity.
- Frequency / Presence Penalties – Help reduce repetition or encourage mentioning new concepts.
- Tool Descriptions – Clearly defined tools (like web search or dataset retrieval) help the model ground its outputs and choose the right action during generation.
| Parameter | Current | New | Description |
|---|---|---|---|
| Model | gpt-4o-mini | gpt-4.1-mini | Slightly higher cost but improved reasoning and classification accuracy; better suited for nuanced intent detection. |
| Temperature | 1.0 | 0.3 | Lowering temperature makes outputs more consistent and less random—ideal for deterministic tasks like classification. |
| Top-p | 1.0 | 0.8 | Reduces the sampling range, encouraging the model to choose higher-probability tokens for more stable predictions. |
- UI
- Python
- TypeScript
Edit prompt parameters
Compare Prompt Versions
Now that we’ve created 2 new versions of our prompt, we need to test them on our dataset to see if our accuracy improved. This will help us figure out if our prompts improved, and what changes lead to the most improvements.- Python
- TypeScript
First, head to your support-classifier prompt in the Phoenix UI and copy the corresponding version IDs for Version 3 and Version 4.

Summary
In this section, we translated our analysis into measurable improvement.We built two new prompt versions, ran them through experiments, and quantified the gains:
- Custom instruction only: Accuracy improved from 53% → 61%
- Instruction + tuned parameters + upgraded model: Accuracy climbed further to 74%

