We Built MacParakeet With AI Agents. Here's What Kept Them on Track.
AI coding agents are fast but they drift. After building a real Mac app with Claude Code, here's the method that kept our agents writing correct code.
AI coding agents are productive and confident. Unfortunately, these traits are uncorrelated with correctness.
If you’ve used Claude Code, Cursor, or Copilot for real work — not a tutorial, not a toy, but software you ship — you’ve watched an agent make a change that looks right, passes review, and quietly breaks something three files away. The agent had no reason to suspect a problem. It made the most plausible edit given what it could see.
That’s the core issue. Not capability. Context.
We built MacParakeet — a Mac voice app with SwiftUI, SQLite persistence, a CLI, and 360+ tests — primarily with AI coding agents. The method that kept them in bounds is straightforward and transferable.
Drift
Hand an agent a task like “add YouTube URL transcription” and it faces hundreds of micro-decisions. Where does download logic live? What errors does it throw? How does progress reporting work? What happens on interruption? Does it reuse the existing data model or create a new one?
Each decision is sampled from a distribution over the agent’s context window. Rich, specific context produces a tight distribution — most probability mass lands on correct choices. Vague context produces a wide one, and the agent fills gaps with plausible guesses drawn from training data rather than your codebase.
These guesses aren’t random. They look like sensible defaults that happen not to match your architecture. A new error type that breaks from your existing hierarchy. A database column in a slightly different format. A state machine that’s almost right but skips a transition the rest of the system expects.
Any single guess might be harmless. Dozens of them, across weeks of development, produce a codebase that works but doesn’t cohere. That’s drift — the central failure mode of AI-assisted development.
Context as probability control
The fix isn’t a better model or a cleverer prompt. It’s tighter context.
Research supports this. LLM performance degrades when relevant information is buried in long, unstructured input — the “lost in the middle” effect. Prompt structure and placement materially affect output quality (Anthropic’s long-context guidance). For software tasks, simpler structured workflows outperform complex autonomous pipelines.
The principle: give the agent exactly the information it needs — requirements, contracts, invariants — and nothing that dilutes attention.
We call this a context zone: the bounded set of behaviors a change is allowed to affect. Before any behavior-changing task, the zone defines:
- What’s in scope — specific requirements with testable acceptance criteria
- What must not change — invariants existing code depends on
- What’s out of scope — behaviors this change should not touch, even if related
- How to verify — tests that prove the change stayed in bounds
The spec kernel
A small set of structured artifacts — a few YAML files and a markdown table — serves as the implementation authority for agents. Fits in one directory. Takes an afternoon to set up.
Requirements — what the system must do, in terms an agent can verify:
- id: REQ-F11-001
title: YouTube URL transcription succeeds for valid YouTube URLs
priority: p0
status: active
acceptance:
- Given a valid YouTube URL, when transcribeURL is called,
then a completed transcription is returned
- Progress emits download and transcription percentages
Contracts — what a function accepts, returns, and guarantees:
name: transcribe_url
input:
url: string
onProgress: optional_callback
output:
transcription: { id: uuid, status: completed }
errors: [invalid_url, video_not_found, download_failed, timed_out]
invariants:
- sourceURL must equal request url
State machines — legal transitions. The agent can’t invent states or skip steps:
name: dictation_flow
initial: idle
states: [idle, recording, processing, success, error]
transitions:
- { from: idle, event: start_recording, to: recording }
- { from: recording, event: stop_recording, to: processing }
- { from: processing, event: stt_ok, to: success }
- { from: processing, event: stt_fail, to: error }
Traceability — a table mapping each requirement to its implementation file and tests.
The difference between “write a YouTube transcription feature” and “implement REQ-F11-001 per the transcribe_url contract” is the difference between a wide distribution and a narrow one. The artifacts are small. The effect on code quality is not.
Bounded discretion
Not every edit needs this. That would be bureaucracy.
The rule: behavior changes go through the kernel. Everything else doesn’t.
Renaming a variable — just do it. Formatting code — just do it. Fixing a comment — just do it. Adding a feature that changes what the app does — define the context zone first, then implement.
Maybe 20-30% of edits touch the kernel. The rest are quick changes where the agent’s judgment is fine. This preserves the speed of AI-assisted development while protecting the changes that matter.
Fast feedback
Context tells the agent what to do. Feedback tells it whether it worked.
We designed the codebase for agent self-verification. The test suite runs in seconds. A CLI smoke-tests core services without launching the GUI. Build errors surface immediately. Database operations test against in-memory SQLite.
The principle: if an agent can’t confirm its change works, the change is incomplete. Every behavior change gets a mapped test. The agent runs focused tests during development and the full suite before finishing.
This is the ReAct pattern applied to engineering: act, observe, adjust. An agent that generates code without testing is guessing. An agent that generates, tests, sees a failure, and fixes it is engineering.
What transfers
The YAML schemas are ours. The ideas aren’t.
Scope before code. Write down what’s in scope, what’s out, and what must not change. Three sentences in the prompt makes a measurable difference.
Structured over prose. “Add user authentication” is a wide distribution. “Implement email/password login, return a JWT, reject invalid credentials with a 401, don’t modify the session middleware” is a narrow one.
Test-to-requirement mapping. Every behavior change gets at least one test. If the agent runs it, the agent catches its own drift.
Small kernel. Minimum structured context that keeps behavior changes on track. If setup takes more than an afternoon, it’s over-engineered.
Speed where safe. Let agents fly on non-behavioral edits. Reserve structure for changes that affect what the system does.
Results
MacParakeet shipped with a clean architecture — dictation, transcription, CLI, database, and export — all built by agents working within context zones.
The defect rate is lower than you’d expect from AI-generated code, because the agents aren’t guessing about architecture or error handling. They implement against contracts, verify against tests, and stay within bounds defined before they write a line.
The method is simple. Tight context, clear boundaries, fast feedback. If you’re building with AI agents and watching quality degrade — more drift, more “works but doesn’t fit” — the fix is probably not a better model. It’s better context.
MacParakeet is a fast, private voice app for Mac — built with AI coding agents. Free and open-source.