Speech to text model inference in pure C.

This is a C implementation of the inference pipeline for the Mistral AI’s Voxtral Realtime 4B model. It has zero external dependencies beyond the C standard library. The MPS inference is decently fast, while the BLAS acceleration is usable but slow (it continuously convert the bf16 weights to fp32).

Audio processing uses a chunked encoder with overlapping windows, bounding memory usage regardless of input length. Audio can also be piped from stdin (–stdin), or captured live from the microphone (–from-mic, macOS), making it easy to transcode and transcribe any format via ffmpeg. A streaming C API (vox_stream_t) lets you feed audio incrementally and receive token strings as they become available.

Similar projects: Whisper.cpp