Magpie

Comment intelligence for YouTube creators

Magpie tells YouTube creators what their audience is actually asking for, by reading the comment sections of their own videos and a handful of competitor channels in the same niche. The output is a ranked list of video ideas, each with a demand-strength score, a rationale, and the supporting quotes, separated into "what your audience wants" from your own videos and "what the niche is asking for" from your competitors.

The problem and why it mattered

Creators decide what to make next from a blurry mix of YouTube Studio metrics, gut feel, and whichever comments happened to be loud the day they checked. The result is predictable: high-effort videos that miss demand sitting in plain sight in their own comment sections.

The signal is there. The comments under a creator's videos contain explicit requests, repeated questions, and unmet curiosity. So do the comments under their competitors' videos — often more visibly, because viewers in the same niche are watching multiple creators and asking each of them slightly different versions of the same question. But the signal is buried in thousands of comments per video, and almost nobody reads them systematically. The largest creators have community managers. Everyone else is flying blind.

Magpie is built for the long middle of YouTube — creators with enough audience that comment volume matters but not enough team to read every comment. The product question is whether AI can do for a single creator what a community manager does for a large one: surface what the audience is actually asking for, with evidence, in time to act on it. The answer Magpie tests is that narrow AI tools, calibrated to a specific niche, can produce better recommendations than general-purpose tools that try to serve every creator at once.

That distinction matters more than it looks. A general chat tool given the same comments would produce reasonable-sounding ideas that read well but ignore what's specific to the niche — a woodworking creator and a calisthenics creator get back recommendations that could swap between them and nobody would notice. Magpie's bet is that niche-specific calibration is what separates an AI tool that produces output a creator can act on from one that produces output a creator has to filter. Every architectural decision in this case study downstream — the niche-creation agent, the calibrated vocabulary, the per-niche demand thresholds, the held-job workflow — is in service of that bet.

A note on competition. There are tools in this space, most notably Ask Studio. They analyze a creator's own comments, broadly, for any creator. Magpie's position is different on two specific axes: it analyzes the creator's videos alongside their competitors' in the same niche — so the report distinguishes "what your audience wants from you" from "what the niche is asking for that you could uniquely deliver" — and it tunes the analysis per niche, with calibrated vocabulary, demand thresholds, and signal patterns specific to woodworking versus calisthenics versus mechanical keyboards. The question Magpie is trying to answer isn't "can we build a comment analyzer?" — that exists. It's "can a small, niche-tuned, dual-source tool produce output a working creator would actually use?"

What I built

Magpie is a deployed, production system — not a prototype. The components:

Backend (FastAPI on Render, Supabase Postgres): job orchestration, the analysis pipeline, the niche-creation agent, and the audit-log infrastructure. ~3,200 lines of Python across the engine.
Frontend (TanStack Start on Lovable): the form, the report renderer, the PDF export. Fully wired to the backend.
The analysis pipeline: pre-filter (rules, no LLM) → classify (Haiku 4.5, intent labeling) → ideate (Opus 4.6, ranked brief generation), with bounded-concurrent fan-out under per-provider semaphores. The model split is deliberate: cheap models for the high-volume labeling work, the expensive reasoning model only for the small number of calls where quality drives output.
The niche-creation agent (Sonnet 4.6): a synchronous LLM-driven workflow that takes free-text niche input from users, matches against a canonical set, and calibrates new niches when no match exists. Persists into a two-table schema — dynamic_niches for the operational profile, niche_calibrations for the append-only audit log.
Production observability: phase-level and per-batch [PERF] markers throughout the pipeline, plus the niche_calibrations audit table that captures every calibration attempt (success or failure) with cost, duration, model, and full raw agent output.

The three-tier model routing (Haiku for labeling, Opus for ideation, Sonnet for the agent) is one of several decisions where the architecture choice and the product choice are the same choice. Decisions 4.2 through 4.5 each unpack one of those.

Magpie system architecture: three-lane diagram showing the TanStack Start frontend, the bounded-concurrency FastAPI backend pipeline on Render, and the niche-creation agent with its two-table persistence. — Three-lane view: frontend, bounded-concurrency backend pipeline, and the niche-creation agent. The dashed line is the calibration loop.

`niche_calibrations` audit row — mechanical keyboards

An actual row from the production audit table. Captured 2026-05-25 07:58:42 UTC.

id: c7865fb2-9565-438e-8d27-b11eb95a2c91
source_input: mechanical keyboards (what the user typed)
normalized_input: mechanical keyboards
resolved_slug: mechanical_keyboards
status: success
failure_reason: null
agent_model: claude-sonnet-4-6
cost_usd: $0.0332
duration_ms: 27,023 (≈27 seconds, single Sonnet pass — no critic-loop retry)
critic_loop_enabled: false

agent_output.technical_vocabulary — 138 items. First twenty:

linear, tactile, clicky, actuation force, pre-travel, total travel,
bottom out, spring swap, lubing, lube, 205g0, krytox, tribosys 3203,
tribosys 3204, dielectric grease, switch film, filming, stem, housing,
top housing, ...

agent_output.prompt_context — the niche-specific reasoning passed to downstream ideation:

This is mechanical keyboard content. Viewers range from beginners buying their first board to enthusiasts deep in the hobby who participate in group buys, mod their switches, and tune stabilizers. Demand signals typically fall into: (1) product review or comparison requests naming specific boards, switches, or keycap sets ('can you review the Zoom65?', 'GMK vs PBT comparison'), (2) build or mod tutorial requests ('how do you lube switches?', 'walk me through the tape mod'), (3) budget or beginner guide requests ('best board under $100', 'beginner guide to lubing'), and (4) sound/feel comparison requests between specific switches or layouts. Be aware that this niche generates a high volume of aesthetic and vibe comments ('so satisfying', 'that thock though', 'ASMR') — these are noise, not demand signals, even when they dominate a comment section. Real demand signals name specific products, techniques, or comparisons.

agent_output.aliases — emitted by the agent, used by the matcher to short-circuit future near-duplicate submissions:

mechanical keyboards · mech keys · custom keyboards · keyboard hobby · keeb

Why this artifact matters: this row is the system's memory of one agent run, captured at the moment it ran, queryable forever. The same row pattern exists for every calibration attempt — successful or not — with the failure mode, the partial output, and the cost recorded. That's the audit table doing its job. The next time I sit down to iterate the calibration prompt, the question "what does my agent actually produce?" is answered by SELECT agent_output FROM niche_calibrations, not by guesswork.

The technical surface is interesting, but the decisions behind it are what this case study is actually about. Six of those decisions follow.

Key decisions and tradeoffs

The six decisions below were chosen because each answers a different question about the kind of builder I am. They sit inside a longer decision log I've kept since the project started; the link to the full log is at the end. These are the ones I'd want a hiring manager to read.

Each decision has a one-paragraph summary; expand any thread to see the full reasoning, the tradeoffs considered, and the principle it codified.

Designing the niche-creation agent end-to-end

A three-layer system (canonical profiles → matching layer → Sonnet calibration agent) that turns free-text niche input into structured business knowledge. Four production-AI patterns inside the agent: synchronous in-job execution, A/B-ready critic loop, harvest-from-loaded-context aliases, and specific-feedback retry on validation failures.

The append-only audit log as a design choice, not a retrofit

Two tables separated by *purpose*: one the system uses, one I use. Every calibration attempt — success or failure — writes a row with model, cost, duration, and the full raw agent output. The cost is one INSERT; the payoff is that "how is my agent actually behaving?" becomes a SQL query instead of a guess.

Empirical validation bounds, and the third time I asked the right question

A schema field had its upper bound tightened or loosened three times before I asked the structural question — *why is there a hard ceiling at all?* The schema was doing editorial work it shouldn't have been doing. Removing the cap and moving quality enforcement into the prompt unlocked the case study's hero artifact.

Bounded-concurrency fan-out, and knowing when to stop optimizing

The Anthropic classification fan-out became `asyncio.gather` under a `Semaphore(6)` — a designed pattern any future third-party integration can land on. The two Opus ideation calls also run in parallel via `asyncio.gather`. Per-batch instrumentation later revealed silent Anthropic ITPM throttling on the lower API tier — and the right move was to ship at the cap, document the ceiling, and stop optimizing rather than re-architect around a third-party constraint.

Surfacing a silent failure I caught in my own production system

While auditing the hero report, I noticed five competitor channels submitted but only three covered — and no explanation anywhere. The pipeline had a bare `except: print; continue` swallowing failures silently. Two-layer fix: pre-flight validation rejects bad handles at submit-time, in-job status tracking captures runtime skips. Trust is a one-way ratchet.

The held-job workflow: honest failure over silent miscalibration

When niche calibration fails, the choice is "ship a fallback report from a generic profile" or "hold the job and tell the user." Fallback is friendlier on the surface and silently wrong — the user can't see they got a miscalibrated report. Held-with-explanation looks like a UX choice but is really a trust choice.

What I learned

Six durable principles, drawn from the work above and from a longer set of decisions in the full log:

Observability is a design choice, not a retrofit. Build the audit log on day one. The cost is one extra database insert per significant action; the payoff is that every "why did the system do that?" question becomes a SQL query rather than a debugging expedition.

Independent I/O is concurrent until proven otherwise. Sequential fan-out has to be justified, not parallelism. And every scope increase on a fan-out is a re-check moment for whether the previous concurrency choice still holds.

When you iterate a parameter in the same direction for the third time, the parameter isn't the bug. The primitive containing it probably is. The third iteration is the signal to step up a level of abstraction.

LLM validation bounds are empirical, not theoretical — when they exist at all. You discover the right shape by running real calibrations against real inputs and observing what the model produces, not by reasoning about it in a spec. Sometimes the right answer is to remove the bound entirely and move enforcement to a different layer.

Audit the system from the user's seat, not the developer's. Read your own output the way a user would. The pipeline can be working correctly in every internal log and still produce a report that lies about what it covered. The only check that catches that gap is the audit from outside.

Fail loudly when the system can't do its job well. The choice between "ship a degraded response" and "surface the failure honestly" is a trust choice. Silent miscalibration in AI products is the worst failure mode because the user can't see it.

What I'd build into the next project from day one: the audit log, the bounded-concurrency pattern, the visible-failure infrastructure, and the schema-design-before-API discipline. Each of these is now a default for me rather than something I reach for when a specific problem demands it. The project paid the cost of learning them; the next system gets the benefit for free.

Links to deeper artifacts

For visitors who want to go deeper than this case study:

Try Magpie live — the working product. Free on any of the three calibrated niches; will calibrate a new niche on first attempt.
See an example report — the mechanical keyboards run referenced throughout this case study.
Source code — the backend (FastAPI) and frontend (TanStack Start) live in a private repository while the product is in active development. Available on request for interview contexts.