Skip to content

Voice System

The Voice System provides speech-to-text (STT) and text-to-speech (TTS) capabilities for Swisper, enabling fully voice-driven conversations. It integrates with Azure Speech Services and uses WebSocket streaming for real-time audio processing.

On the input side, voice audio is streamed from the browser, transcribed to text, and fed into the Global Supervisor. On the output side, the generated text response is converted to speech and streamed back to the user.

Key Components

Component Purpose
STT Integration Speech-to-text via Azure Speech Services with streaming transcription
TTS Integration Text-to-speech with configurable voice and language settings
WebSocket Handler Real-time audio streaming between browser and backend
Voice Session Manager Manages voice session lifecycle and audio format negotiation

Documentation Sections

Content Status

Audience section content (Overview, Architecture, Operations) will be populated during content migration (PR-6, SP-62). Section placeholders exist for navigation purposes.