OpenAI Realtime API Rust Implementation

For the past decade, I have been spending most of my time in Python - until I discovered the wonderful world of Rust. I understand that Python is still king for everything AI and AI labs have little incentive to create Rust SDKs for their APIs and services. However, I hope that will change as the Rust community is growing.

For anyone wanting to try OpenAI’s real-time API, I created a repository that provides a minimal Rust implementation, enabling bidirectional voice conversations through WebSocket connections. The implementation handles real-time audio streaming, automatic speech detection, and live transcription display.

Technical Overview

The project implements a WebSocket client that connects to OpenAI’s Realtime API endpoint and manages concurrent audio streams. It captures microphone input at 24kHz mono PCM16 format, streams it to the API, and plays back AI-generated audio responses through the system’s default output device.

The architecture consists of:

WebSocket connection handler using tokio-tungstenite
Audio input/output streams via the cpal library
Automatic sample rate conversion for devices not supporting 24kHz
Unbounded audio queue to handle responses of any duration
Server-side Voice Activity Detection (VAD) for natural conversation flow

Setup and Installation

Prerequisites:

Rust 1.70 or later
OpenAI API key with Realtime API access
Working audio input/output devices

Clone the repository and set up your environment:

git clone https://github.com/yourusername/openai-realtime-rust.git
cd openai-realtime-rust

Create a .env file with your API key:

echo "OPENAI_API_KEY=sk-your-api-key-here" > .env

Running the Application

Build and run with cargo:

cargo run --release

The application will:

Connect to wss://api.openai.com/v1/realtime with your API key
Initialize audio streams using your default microphone and speakers
Configure the session with VAD parameters
Begin listening for voice input

Once you see “You can now speak into your microphone…”, start talking. The server will detect when you’ve finished speaking and respond with synthesized audio.

Key Implementation Details

The main components are defined in src/main.rs:

Audio Configuration

const TARGET_SAMPLE_RATE: u32 = 24000;  // API requirement

Session Setup

The session is configured with these parameters:

Modalities: text and audio
Voice: “alloy” (configurable)
Audio format: PCM16 at 24kHz
VAD threshold: 0.5
Silence duration: 200ms

WebSocket Message Handling

Events are processed in the main loop (src/main.rs:369-425):

response.audio.delta: Audio chunks from AI
response.audio_transcript.done: AI response transcription
conversation.item.input_audio_transcription.completed: User speech transcription

Audio Processing Pipeline

Input stream captures audio, converts to PCM16, and sends via WebSocket. Output stream reads from a shared queue and plays audio chunks. Sample rate conversion handles devices that don’t support 24kHz natively.

Dependencies

Core dependencies from Cargo.toml:

tokio 1.41 - Async runtime
tokio-tungstenite 0.24 - WebSocket client
cpal 0.15 - Cross-platform audio
serde/serde_json - JSON serialization
base64 0.22 - Audio encoding
dotenv 0.15 - Environment variables

Error Handling

The implementation includes handling for:

Missing API keys
WebSocket connection failures
Audio device initialization errors
Sample rate mismatches
Buffer underruns/overruns

Common issues and solutions:

403 errors indicate API key lacks Realtime API access
Audio feedback can be resolved using headphones
Garbled audio is automatically handled through proper buffering

Limitations

This is a minimal implementation focused on demonstrating core functionality. Production deployments should add:

Reconnection logic for network failures
Audio device selection UI
Error recovery mechanisms
Conversation history persistence
Multiple voice model selection

The repository serves as a foundation for building voice-enabled applications with OpenAI’s Realtime API in Rust, providing the essential WebSocket and audio handling infrastructure needed for real-time AI conversations. I hope you enjoy tinkering with it.