Skip to content

2025

LLM Quality vs Cost Evaluation Study

When building with Large Language Models, achieving top-tier quality is often the primary goal. For any serious application, sacrificing quality for cost isn't an option. But is it always necessary to pay a premium for the 'best' and brightest model without a clear understanding of the costs?

To find a truly optimal solution, it's essential to evaluate different models to see if comparable quality can be achieved more cost-effectively. That's exactly what I did in my study. I ran a systematic experiment to find the true relationship between quality, reliability, and cost for a complex, large-context task.

Large language models are expensive to run at scale. For my speaker attribution task on 45-minute podcast episodes, I needed to know: can I get good quality for less money? I tested 7 LLM models across 4 diverse episodes to find out. What I discovered wasn't just about cost optimization. It revealed that multi-episode validation is essential for avoiding production disasters.

Key Findings

  • Multi-episode testing is critical: Models that performed perfectly on one episode failed completely on others (same settings, different content)
  • Best for reliability: Claude Sonnet 4.5 (92.2% average accuracy, zero variance across episodes, $0.23/episode)
  • Best for peak quality: GPT-5 (91.0% average accuracy, reaches 93-97% at high token limits, $0.27/episode)
  • Best cost-performance: Grok 4 Fast Reasoning (89.0% average accuracy, 97% cheaper than premium models, $0.007/episode)
  • Avoid for production: Gemini 2.5 Flash (35.0% average accuracy with wildly unpredictable 0-93% range despite low cost)
  • The max_output_tokens parameter matters: Each model has a distinct behavioral pattern. Some plateau early, others need room to "breathe"
  • Cost optimization is possible: Up to 97% cost reduction while maintaining acceptable quality, but only with careful multi-episode validation

Finfluencers Directory Got an Upgrade

When I first launched the Finfluencers Directory, it was just a simple list. I've just rolled out the first major upgrade to make it much better and more user-friendly tool.

Now, you can filter, search, and quickly navigate the 30+ sources to find exactly what you're looking for.

How I Apply Spec-Driven AI Coding

LLMs code better when they focus on a single task at hand instead of trying to solve multiple issues on your codebase at once. Carl Rannaberg recently introduced a plan-based AI coding workflow in his article, "My current AI coding workflow", where LLMs first use a planner phase to create a task plan for the new feature you are developing, and then an executor phase goes through the plan to generate code task by task. I used the method for a while and liked it a lot.

Now, there is a new kid on the block, Kiro.dev from AWS, that goes even further by allowing the planner mode to first create the requirements spec, then the design, and only after that, the tasks list. As I'm still on Kiro's waitlist, I applied the methodology as a unified workflow for all the coding assistants at my disposal: Cursor, Claude, and Gemini.

I've put the framework up on GitHub as https://github.com/andreskull/spec-driven-ai-coding

Critique of 2024 paper "Highly Regarded Investors? Mining Predictive Value from the Collective Intelligence of Reddit's WallStreetBets" by Buz et al..

Introduction

Can the collective chatter of online forums like Reddit's r/WallStreetBets (WSB) actually predict stock market moves? This question challenges the long-held Efficient Market Hypothesis (EMH) 1, which states that all public information is already baked into stock prices. The explosion of communities like WSB, capable of shaping investor sentiment at lightning speed, suggests that behavior-driven opportunities might exist, at least temporarily. The real challenge is scientifically separating a true predictive signal from all the noise.

This blog post critically reviews the 2024 paper, "Highly Regarded Investors? Mining Predictive Value from the Collective Intelligence of Reddit's WallStreetBets" by Buz et al.2. While the paper builds a profitable trading model, I argue its methodology is fundamentally unsuited to prove that WSB data alone has predictive power. My analysis will show that the model's impressive results are overwhelmingly driven by traditional financial data—specifically investment bank ratings and price history—not the unique "collective intelligence" of Reddit. The paper doesn't isolate a new signal; it validates a hybrid strategy where social media acts as a trigger within a classic quant framework.

Highly Regarded Investors? Mining Predictive Value from the Collective Intelligence of Reddit's WallStreetBets, Buz, T., Schneider, M., Kaffee, L. A., & de Melo, G. (2024)

⭐⭐⭐

Paper: Buz, T., Schneider, M., Kaffee, L. A., & de Melo, G. (2024). Highly Regarded Investors? Mining Predictive Value from the Collective Intelligence of Reddit's WallStreetBets. ACM Web Science Conference. Link: https://doi.org/10.1145/3614419.3643993

A detailed study that analyzes 1.6 million WallStreetBets posts to see if there's real predictive value hidden in the memes and "YOLO" trades. The authors use machine learning to determine if the "collective intelligence" of this retail investor army can actually beat the market.

The Trolls of Wall Street: How the Outcasts and Insurgents Are Hacking the Markets by Nathaniel Popper (2024)

⭐⭐⭐⭐⭐

Book: The Trolls of Wall Street: How the Outcasts and Insurgents Are Hacking the Markets by Nathaniel Popper
Link: Goodreads

Popper traces the complete history of the WallStreetBets subreddit from its creation through all the major events and sagas that shaped it. While the GameStop saga is the most famous episode, he walks you through the entire evolution of this chaotic community and the various market-moving events it spawned. This isn't just another retelling of meme stock madness - it's a detailed look at the people and culture behind one of the most disruptive financial movements in recent history.

A Practical Guide to Deploying Dagster Hybrid on GCP

In my previous post, "How I Built a Modern Data Orchestration Layer for Finfluencers.trade", I explained why I chose Dagster for my project — a platform that transforms financial content into structured, auditable data. The core challenge I wanted to solve was creating complete traceability and lineage for every piece of data, from raw podcast transcripts and article text to final predictions on the website.

This post dives into the how — a detailed, technical guide for setting up a Dagster Hybrid deployment on Google Cloud Platform (GCP).

How I Built a Traceable Data Pipeline for Finfluencers.trade

The core of my project, Finfluencers.trade, is turning a chaotic stream of public content—podcasts, videos, and tweets—into a structured, auditable database of financial predictions. To do this reliably, I needed a rock-solid system for managing how data moves and transforms.

This post explains how I set up that system—my data pipeline's orchestration—from day one.

The Finfluencer Mirage

Navigating the world of online financial advice can feel like searching for a life raft in a stormy sea. Everywhere you turn, "finfluencers" promise quick tips and pathways to prosperity to their vast digital followings. But what if the loudest voices aren't the wisest? What if popularity is a misleading beacon, potentially luring unsuspecting investors towards financial reefs instead of safe harbors?