OpenAI Realtime API Rust Implementation
For the past decade, I have been spending most of my time in Python - until I discovered the wonderful world of Rust. I understand that Python is still king for everything AI and AI labs have little incentive to create Rust SDKs for their APIs and services. However, I hope that will change as the Rust community is growing.
For anyone wanting to try OpenAI’s real-time API, I created a repository that provides a minimal Rust implementation, enabling bidirectional voice conversations through WebSocket connections. The implementation handles real-time audio streaming, automatic speech detection, and live transcription display.
Technical Overview
The project implements a WebSocket client that connects to OpenAI’s Realtime API endpoint and manages concurrent audio streams. It captures microphone input at 24kHz mono PCM16 format, streams it to the API, and plays back AI-generated audio responses through the system’s default output device.
The architecture consists of:
- WebSocket connection handler using
tokio-tungstenite
- Audio input/output streams via the
cpal
library - Automatic sample rate conversion for devices not supporting 24kHz
- Unbounded audio queue to handle responses of any duration
- Server-side Voice Activity Detection (VAD) for natural conversation flow
Setup and Installation
Prerequisites:
- Rust 1.70 or later
- OpenAI API key with Realtime API access
- Working audio input/output devices
Clone the repository and set up your environment:
git clone https://github.com/yourusername/openai-realtime-rust.git
cd openai-realtime-rust
Create a .env
file with your API key:
echo "OPENAI_API_KEY=sk-your-api-key-here" > .env
Running the Application
Build and run with cargo:
cargo run --release
The application will:
- Connect to
wss://api.openai.com/v1/realtime
with your API key - Initialize audio streams using your default microphone and speakers
- Configure the session with VAD parameters
- Begin listening for voice input
Once you see “You can now speak into your microphone…”, start talking. The server will detect when you’ve finished speaking and respond with synthesized audio.
Key Implementation Details
The main components are defined in src/main.rs
:
Audio Configuration
const TARGET_SAMPLE_RATE: u32 = 24000; // API requirement
Session Setup
The session is configured with these parameters:
- Modalities: text and audio
- Voice: “alloy” (configurable)
- Audio format: PCM16 at 24kHz
- VAD threshold: 0.5
- Silence duration: 200ms
WebSocket Message Handling
Events are processed in the main loop (src/main.rs:369-425):
response.audio.delta
: Audio chunks from AIresponse.audio_transcript.done
: AI response transcriptionconversation.item.input_audio_transcription.completed
: User speech transcription
Audio Processing Pipeline
Input stream captures audio, converts to PCM16, and sends via WebSocket. Output stream reads from a shared queue and plays audio chunks. Sample rate conversion handles devices that don’t support 24kHz natively.
Dependencies
Core dependencies from Cargo.toml
:
tokio
1.41 - Async runtimetokio-tungstenite
0.24 - WebSocket clientcpal
0.15 - Cross-platform audioserde
/serde_json
- JSON serializationbase64
0.22 - Audio encodingdotenv
0.15 - Environment variables
Error Handling
The implementation includes handling for:
- Missing API keys
- WebSocket connection failures
- Audio device initialization errors
- Sample rate mismatches
- Buffer underruns/overruns
Common issues and solutions:
- 403 errors indicate API key lacks Realtime API access
- Audio feedback can be resolved using headphones
- Garbled audio is automatically handled through proper buffering
Limitations
This is a minimal implementation focused on demonstrating core functionality. Production deployments should add:
- Reconnection logic for network failures
- Audio device selection UI
- Error recovery mechanisms
- Conversation history persistence
- Multiple voice model selection
The repository serves as a foundation for building voice-enabled applications with OpenAI’s Realtime API in Rust, providing the essential WebSocket and audio handling infrastructure needed for real-time AI conversations. I hope you enjoy tinkering with it.