Skip to content

Building in Public

LLM Quality vs Cost Evaluation Study

When building with Large Language Models, achieving top-tier quality is often the primary goal. For any serious application, sacrificing quality for cost isn't an option. But is it always necessary to pay a premium for the 'best' and brightest model without a clear understanding of the costs?

To find a truly optimal solution, it's essential to evaluate different models to see if comparable quality can be achieved more cost-effectively. That's exactly what I did in my study. I ran a systematic experiment to find the true relationship between quality, reliability, and cost for a complex, large-context task.

Large language models are expensive to run at scale. For my speaker attribution task on 45-minute podcast episodes, I needed to know: can I get good quality for less money? I tested 7 LLM models across 4 diverse episodes to find out. What I discovered wasn't just about cost optimization. It revealed that multi-episode validation is essential for avoiding production disasters.

Key Findings

  • Multi-episode testing is critical: Models that performed perfectly on one episode failed completely on others (same settings, different content)
  • Best for reliability: Claude Sonnet 4.5 (92.2% average accuracy, zero variance across episodes, $0.23/episode)
  • Best for peak quality: GPT-5 (91.0% average accuracy, reaches 93-97% at high token limits, $0.27/episode)
  • Best cost-performance: Grok 4 Fast Reasoning (89.0% average accuracy, 97% cheaper than premium models, $0.007/episode)
  • Avoid for production: Gemini 2.5 Flash (35.0% average accuracy with wildly unpredictable 0-93% range despite low cost)
  • The max_output_tokens parameter matters: Each model has a distinct behavioral pattern. Some plateau early, others need room to "breathe"
  • Cost optimization is possible: Up to 97% cost reduction while maintaining acceptable quality, but only with careful multi-episode validation

How I Apply Spec-Driven AI Coding

LLMs code better when they focus on a single task at hand instead of trying to solve multiple issues on your codebase at once. Carl Rannaberg recently introduced a plan-based AI coding workflow in his article, "My current AI coding workflow", where LLMs first use a planner phase to create a task plan for the new feature you are developing, and then an executor phase goes through the plan to generate code task by task. I used the method for a while and liked it a lot.

Now, there is a new kid on the block, Kiro.dev from AWS, that goes even further by allowing the planner mode to first create the requirements spec, then the design, and only after that, the tasks list. As I'm still on Kiro's waitlist, I applied the methodology as a unified workflow for all the coding assistants at my disposal: Cursor, Claude, and Gemini.

I've put the framework up on GitHub as https://github.com/andreskull/spec-driven-ai-coding

A Practical Guide to Deploying Dagster Hybrid on GCP

In my previous post, "How I Built a Modern Data Orchestration Layer for Finfluencers.trade", I explained why I chose Dagster for my project — a platform that transforms financial content into structured, auditable data. The core challenge I wanted to solve was creating complete traceability and lineage for every piece of data, from raw podcast transcripts and article text to final predictions on the website.

This post dives into the how — a detailed, technical guide for setting up a Dagster Hybrid deployment on Google Cloud Platform (GCP).

How I Built a Traceable Data Pipeline for Finfluencers.trade

The core of my project, Finfluencers.trade, is turning a chaotic stream of public content—podcasts, videos, and tweets—into a structured, auditable database of financial predictions. To do this reliably, I needed a rock-solid system for managing how data moves and transforms.

This post explains how I set up that system—my data pipeline's orchestration—from day one.

Securing a Newsletter Subscription Form on a Static MkDocs Site

Adding a newsletter subscription form to a static site seems straightforward—until you consider security implications. In this post, I'll share how I implemented a secure newsletter subscription for my MkDocs-based blog using Cloudflare Turnstile and a serverless API.

For the Finfluencers Trade blog, built with MkDocs (a static site generator), I wanted a simple way for readers to subscribe to a newsletter for updates. Static sites can't run server-side code directly, so handling form submissions requires a different approach than dynamic sites (like WordPress or Node.js apps). I needed a solution that was secure, reliable, respected user privacy, and didn't require me to manage a backend just for email signups.

Why I Chose MkDocs for the Blog

Building Finfluencers Trade is a journey, and I'm committed to documenting it openly – the technical challenges, research findings, and strategic pivots. This blog is central to that "Building in Public" approach.

Documenting this journey requires a blog that's easy to manage and integrates with my development workflow. After considering different options, I chose MkDocs with the Material theme because its focus on Markdown, static site generation, and developer experience perfectly matched my needs for the Finfluencers Trade blog.