New v0.4.0 — Speaker diarization, AI transcripts, custom hotkeys, and more. See what's new →

← Blog

Whisper, Parakeet, and the Race to Run Speech Recognition on Apple Silicon

In three years, Mac speech recognition went from slow Python scripts to native code on a dedicated chip — and for many users, made cloud APIs unnecessary. Here's how.

Your Mac has three computers inside it. A CPU for general work. A GPU for graphics and parallel computation. And a Neural Engine — a dedicated chip designed to run machine learning models as fast and efficiently as possible.

This is the story of how Mac speech recognition evolved from slow Python scripts running Whisper on the GPU to native Swift code running Parakeet on the Neural Engine — and why, for a growing number of users, sending voice to a cloud server no longer makes sense.

The Whisper moment

In September 2022, OpenAI released Whisper. Open source. Free. A speech-to-text model that was genuinely good — not research-demo good, but useful-in-production good. Multiple sizes from tiny (39 million parameters) to large (1.5 billion), supporting 99 languages.

Before Whisper, local speech recognition on a Mac meant Apple’s built-in dictation, which was limited, or expensive commercial software. After Whisper, anyone could run a capable STT model on their own hardware. The model was open. The only question was how to run it well.

The first wave of Mac transcription apps appeared within months. MacWhisper. Superwhisper. Aiko. Most ran Whisper through Python — the model’s native environment. You’d install Python, download the model, and run inference through PyTorch. It worked. It was slow. Whisper large-v3 — the most accurate variant — processed audio at roughly 1x realtime on an M1. A 60-minute file took about 60 minutes. And it consumed nearly 4 GB of RAM doing it.

The smaller models were faster but less accurate. Whisper tiny ran at 32x realtime but produced noticeably worse transcripts. You could have speed or accuracy, not both.

WhisperKit brings Whisper home

In early 2024, a team called Argmax — founded by former Apple machine learning engineers — released WhisperKit. It was the bridge Whisper needed.

WhisperKit compiled Whisper models into CoreML format, Apple’s on-device ML framework. CoreML knows how to route computations to the right chip. Neural network layers go to the Neural Engine. Signal processing stays on the CPU. The result is faster inference, lower memory, and no Python.

The improvement was real. In their ICML 2025 paper, Argmax reported WhisperKit achieving 2.2% word error rate using their optimized Large v3 Turbo model on the Neural Engine — while running at real-time streaming latency. Apple featured WhisperKit in WWDC sample projects and ML starter materials.

MacWhisper, VoiceInk, and other apps adopted WhisperKit or similar CoreML pipelines. Transcription speed improved substantially. But the underlying model was still Whisper, and Whisper had a ceiling. Even with CoreML optimization, Large-v3 achieved roughly 15-30x realtime on Apple Silicon. Good enough for a transcription app. Not fast enough to feel invisible for live dictation, where every hundred milliseconds between speaking and seeing text feels like lag.

A different architecture

While the Whisper ecosystem was maturing on Mac, NVIDIA released something new. Parakeet TDT — Token-and-Duration Transducer — was a fundamentally different design built for speed without sacrificing accuracy.

The key innovation is in the name. Traditional transducer models process audio frame by frame, running the decoder at every time step regardless of whether new content has appeared. Parakeet’s Token-and-Duration Transducer predicts both a token and how many frames that token spans in a single step. Instead of stepping through every frame sequentially, the model advances by the predicted duration — skipping frames it has already accounted for. The result is roughly 2.8x faster decoding than a conventional transducer, according to NVIDIA’s measurements, because most frames in a typical audio stream don’t produce new tokens.

NVIDIA’s benchmark results reflected this. The English-only Parakeet TDT 0.6B-v2 achieved 1.69% word error rate on LibriSpeech test-clean with 600 million parameters — less than half the size of Whisper large-v3’s 1.55 billion.

The multilingual v3 variant followed, covering 25 European languages with automatic detection. This is the model MacParakeet ships — it trades some English accuracy for broad European language support.

The catch: Parakeet was built for NVIDIA’s NeMo framework. Python. CUDA GPUs. Not exactly a native Mac experience.

A library called parakeet-mlx ported the model to Apple Silicon through MLX, Apple’s machine learning framework for Metal GPUs. This was MacParakeet’s first STT backend — a Python daemon running Parakeet through MLX, communicating with the Swift app over JSON-RPC. Fast transcription, but with Python’s complexity hiding behind the scenes: a managed virtual environment, subprocess lifecycle, serialization overhead, and roughly 2 GB of RAM for the daemon alone.

It worked well enough to ship. But it wasn’t the right long-term architecture, for reasons that become clear when you add a second model.

The GPU problem

Running Parakeet on the GPU through MLX worked. But it monopolized the GPU — any other ML workload had to wait. On a Mac with three compute units (CPU, GPU, Neural Engine), using only one of them for speech recognition is a waste of silicon.

Move STT off the GPU entirely.

Two parallel paths

By early 2026, two clear paths had formed for running local STT natively on Mac:

Whisper path: OpenAI’s model → WhisperKit (Argmax) → CoreML → Neural Engine

Parakeet path: NVIDIA’s model → parakeet-mlx → MLX → GPU

Same destination — fast local transcription on Apple Silicon — but different models, different runtimes, different chips. The Whisper ecosystem had the polish of native Swift and CoreML. The Parakeet ecosystem had the raw speed and accuracy but was still running through Python on the GPU.

Then FluidAudio closed the gap.

FluidAudio

FluidAudio is an open-source Swift SDK by FluidInference that does for Parakeet what WhisperKit did for Whisper: compiles it into CoreML and runs it natively on the Neural Engine.

The project has shipped 37 releases in nine months, earned 1,500 GitHub stars, and already powers over 20 production apps including VoiceInk and Spokenly. The SDK is Apache 2.0 licensed and provides a native Swift API: initialize a model, call transcribe(), get back words with timestamps and confidence scores. No Python. No subprocess. No serialization layer.

What makes FluidAudio significant isn’t just convenience. It’s hardware allocation.

When Parakeet ran through MLX, it occupied the GPU. When it runs through FluidAudio’s CoreML backend, it moves to the Neural Engine. The GPU is free for other workloads — reduced contention, each task on silicon designed for it:

CPU   →  App logic, UI, database
GPU   →  Available for other ML workloads
ANE   →  Parakeet TDT STT (speech recognition)

The Neural Engine

The Neural Engine deserves its own explanation, because it’s the reason local speech recognition became practical as a background process.

Every Apple Silicon chip contains a dedicated Neural Engine with 16 cores optimized for tensor operations — the matrix multiplications that neural networks run on. It operates on a separate execution pipeline from the CPU and GPU, and Apple has steadily increased its throughput:

ChipNeural EngineYear
M111 TOPS2020
M215.8 TOPS2022
M318 TOPS2023
M438 TOPS2024

The M4’s Neural Engine delivers 38 trillion operations per second while consuming a few watts — a fraction of what a discrete GPU draws for the same workload. The Neural Engine isn’t faster in absolute terms than a datacenter GPU, but it’s dramatically more efficient for on-device inference. That efficiency is why your MacBook can transcribe an hour of audio in seconds without the fans spinning up or the battery draining.

When FluidAudio runs Parakeet on the Neural Engine, the working memory footprint is about 66 MB. The same model on GPU through MLX used roughly 2 GB. CoreML’s optimized scheduling and the Neural Engine’s inference-oriented design account for most of that reduction.

The numbers

With both models now running natively on the Neural Engine, here’s how they compare. Published vendor benchmarks except where noted.

WhisperKit (large-v3 turbo)Parakeet TDT 0.6B-v2 (English)Parakeet TDT 0.6B-v3 (multilingual)
Parameters~1 billion600 million600 million
English WER2.2%1.69%~5.4% (internal)
Speed (Apple Silicon)Real-time streaming~110-300x realtime~155-237x realtime
RAM (ANE)N/A~66 MB~66 MB
Languages99English only25 European
ArchitectureEncoder-decoder (optimized)Token-Duration TransducerToken-Duration Transducer
Native SDKWhisperKit (Argmax)FluidAudio (FluidInference)FluidAudio (FluidInference)

Parakeet speed measured via FluidAudio on Apple Silicon. WhisperKit WER from ICML 2025 paper. Parakeet v2 WER from NVIDIA’s published LibriSpeech test-clean results.

A few things stand out. The English-only Parakeet v2 achieves the lowest word error rate with the fewest parameters. WhisperKit achieves the best streaming latency and supports 99 languages. Parakeet v3 trades English accuracy for 25-language support with auto-detection.

The language trade-off is the most important factor. For Mandarin, Arabic, Hindi, Japanese, or other non-European languages, Whisper through WhisperKit is the right choice. For European languages, Parakeet is faster, more accurate, and uses dramatically less memory.

What cloud transcription costs

The comparison that matters most isn’t Whisper vs. Parakeet. It’s local vs. cloud.

Cloud speech-to-text APIs charge per minute of audio processed:

ProviderPer minutePer hour100 hours/month
OpenAI Whisper API$0.006$0.36$36
Deepgram$0.008$0.48$48
AssemblyAI$0.006$0.36$36
Google Cloud STT$0.024$1.44$144
AWS Transcribe$0.024$1.44$144

Prices as of early 2026. Standard tiers; volume discounts and free tiers vary.

A freelance journalist transcribing interviews. A podcaster processing back-catalog episodes. A researcher working through recorded lectures. At 100 hours a month, you’re paying $36-144 per month for cloud transcription — or $0 locally, on hardware you already own.

The cost isn’t just financial. Every cloud transcription sends your audio to someone else’s infrastructure. Data retention policies vary by provider and tier, and they can change — what’s deleted today might be retained tomorrow under updated terms. Local transcription sidesteps the question entirely. Your audio goes from your microphone into memory, becomes text, and the audio is discarded. Nothing touches a network interface.

What changed in MacParakeet

We migrated our STT backend from parakeet-mlx (Python, MLX, GPU) to FluidAudio (Swift, CoreML, Neural Engine). The rest of the app — dictation service, transcription service, text pipeline, CLI — calls the same interface it always did. The implementation behind that interface is entirely new.

What got better:

Memory dropped from roughly 2 GB to 66 MB. The Python daemon loaded the entire model into GPU memory. FluidAudio’s CoreML runtime uses the Neural Engine’s dedicated memory path, which is dramatically more efficient for a background app.

No more Python. No virtual environment bootstrapping on first launch. No subprocess to monitor, restart on crash, or debug when it hangs. MacParakeet is a single Swift process.

Accuracy improved. FluidAudio’s CoreML implementation achieves approximately 2.5% word error rate on our internal benchmarks — better than the 6.3% we measured with parakeet-mlx. The improvement likely reflects differences in decoding implementation (beam search configuration, tokenizer handling) rather than the model weights themselves, which are identical.

App Store became possible. Apple’s sandbox rules prohibit spawning arbitrary subprocesses. A Python daemon was a permanent distribution blocker. A native Swift framework is not.

Speaker diarization came free. FluidAudio bundles pyannote Community-1 — a speaker diarization pipeline that identifies who said what. It ships in the same SDK we already use for transcription, at roughly 15% diarization error rate and 122x realtime processing speed.

And the honest trade-offs:

Speed varies by chip generation. On an M1, FluidAudio achieves approximately 155x realtime — still fast, but slower than the 300x we measured on GPU via MLX. On newer chips (M2 Pro and later), higher Neural Engine throughput closes that gap. A 60-minute file processes in about 23 seconds on M1, faster on newer hardware.

Larger initial download. CoreML models are pre-compiled for specific hardware targets, which increases the first-launch download from roughly 2.5 GB to about 6 GB. After that, everything runs offline.

Both trade-offs are worth it. A base M1 is still transcribing an hour of audio in under 30 seconds — fast enough that you’ll never wait for it. And the system-wide benefits — zero GPU contention, 97% less RAM, a simpler and more reliable architecture — outweigh a throughput difference on one chip variant.

Where this leaves Mac speech recognition

Three years ago, running a good speech model locally on a Mac meant Python, pip, virtual environments, and patience. The apps that shipped on early Whisper ports deserve credit for proving the market existed.

Today, two production-quality STT models run natively on Apple Silicon through Swift SDKs. Both run on the Neural Engine. Both are open source. Both have active ecosystems shipping real products.

For European language transcription, Parakeet through FluidAudio is the fastest and most accurate option available on a Mac. For non-European languages, Whisper through WhisperKit remains the right choice. The path forward likely includes both — and Apple’s Neural Engine, which has grown from 11 to 38 TOPS in four generations, has headroom for whatever comes next.

The case for cloud speech-to-text has narrowed to a specific set of needs: broad multilingual coverage, zero-setup convenience, or AI-powered text rewriting that transforms what you said into what you meant to write. Those are real needs for some users. But for those who speak European languages and want fast, private, accurate transcription — the answer is already running on a chip inside their Mac, waiting to be used.


MacParakeet is a fast, private voice app for Mac — system-wide dictation and file transcription, powered by Parakeet TDT on Apple’s Neural Engine. Free and open-source.