- cross-posted to:
- artificial_intel@lemmy.ml
- lobsters@lemmy.bestiver.se
- cross-posted to:
- artificial_intel@lemmy.ml
- lobsters@lemmy.bestiver.se
Speech to text model inference in pure C.
This is a C implementation of the inference pipeline for the Mistral AI’s Voxtral Realtime 4B model. It has zero external dependencies beyond the C standard library. The MPS inference is decently fast, while the BLAS acceleration is usable but slow (it continuously convert the bf16 weights to fp32).
Audio processing uses a chunked encoder with overlapping windows, bounding memory usage regardless of input length. Audio can also be piped from stdin (–stdin), or captured live from the microphone (–from-mic, macOS), making it easy to transcode and transcribe any format via ffmpeg. A streaming C API (vox_stream_t) lets you feed audio incrementally and receive token strings as they become available.
Similar projects: Whisper.cpp

