Research¶

2025-10-13
in Building in Public, Research
21 min read

LLM Quality vs Cost Evaluation Study

When building with Large Language Models, achieving top-tier quality is often the primary goal. For any serious application, sacrificing quality for cost isn't an option. But is it always necessary to pay a premium for the 'best' and brightest model without a clear understanding of the costs?

To find a truly optimal solution, it's essential to evaluate different models to see if comparable quality can be achieved more cost-effectively. That's exactly what I did in my study. I ran a systematic experiment to find the true relationship between quality, reliability, and cost for a complex, large-context task.

Large language models are expensive to run at scale. For my speaker attribution task on 45-minute podcast episodes, I needed to know: can I get good quality for less money? I tested 7 LLM models across 4 diverse episodes to find out. What I discovered wasn't just about cost optimization. It revealed that multi-episode validation is essential for avoiding production disasters.

Key Findings

Multi-episode testing is critical: Models that performed perfectly on one episode failed completely on others (same settings, different content)
Best for reliability: Claude Sonnet 4.5 (92.2% average accuracy, zero variance across episodes, $0.23/episode)
Best for peak quality: GPT-5 (91.0% average accuracy, reaches 93-97% at high token limits, $0.27/episode)
Best cost-performance: Grok 4 Fast Reasoning (89.0% average accuracy, 97% cheaper than premium models, $0.007/episode)
Avoid for production: Gemini 2.5 Flash (35.0% average accuracy with wildly unpredictable 0-93% range despite low cost)
The max_output_tokens parameter matters: Each model has a distinct behavioral pattern. Some plateau early, others need room to "breathe"
Cost optimization is possible: Up to 97% cost reduction while maintaining acceptable quality, but only with careful multi-episode validation