DIY streaming for text to speech models

Overview

When you chat with an LLM, the model starts responding as soon as the first characters of output are ready rather than making you wait for it to write the entire reply. You can do the same for text to speech models. Streaming audio output in real time with super-fast (~200ms) time to first chunk unlocks massive use cases across conversational user interfaces.

In this demo, I’ll walk through the code for implementing a streaming endpoint for XTTS V2 and calling the endpoint in production. We’ll use the model to generate real-time speech from audience suggestions.

Tech stack