Back to Blog

OpenAI Realtime API Rust Implementation

Rust By Raja Patnaik

For the past decade, I have been spending most of my time in Python - until I discovered the wonderful world of Rust. I understand that Python is still king for everything AI and AI labs have little incentive to create Rust SDKs for their APIs and services. However, I hope that will change as the Rust community is growing.

For anyone wanting to try OpenAI’s real-time API, I created a repository that provides a minimal Rust implementation, enabling bidirectional voice conversations through WebSocket connections. The implementation handles real-time audio streaming, automatic speech detection, and live transcription display.

Technical Overview

The project implements a WebSocket client that connects to OpenAI’s Realtime API endpoint and manages concurrent audio streams. It captures microphone input at 24kHz mono PCM16 format, streams it to the API, and plays back AI-generated audio responses through the system’s default output device.

The architecture consists of:

  • WebSocket connection handler using tokio-tungstenite
  • Audio input/output streams via the cpal library
  • Automatic sample rate conversion for devices not supporting 24kHz
  • Unbounded audio queue to handle responses of any duration
  • Server-side Voice Activity Detection (VAD) for natural conversation flow

Setup and Installation

Prerequisites:

  • Rust 1.70 or later
  • OpenAI API key with Realtime API access
  • Working audio input/output devices

Clone the repository and set up your environment:

git clone https://github.com/yourusername/openai-realtime-rust.git
cd openai-realtime-rust

Create a .env file with your API key:

echo "OPENAI_API_KEY=sk-your-api-key-here" > .env

Running the Application

Build and run with cargo:

cargo run --release

The application will:

  1. Connect to wss://api.openai.com/v1/realtime with your API key
  2. Initialize audio streams using your default microphone and speakers
  3. Configure the session with VAD parameters
  4. Begin listening for voice input

Once you see “You can now speak into your microphone…”, start talking. The server will detect when you’ve finished speaking and respond with synthesized audio.

Key Implementation Details

The main components are defined in src/main.rs:

Audio Configuration

const TARGET_SAMPLE_RATE: u32 = 24000;  // API requirement

Session Setup

The session is configured with these parameters:

  • Modalities: text and audio
  • Voice: “alloy” (configurable)
  • Audio format: PCM16 at 24kHz
  • VAD threshold: 0.5
  • Silence duration: 200ms

WebSocket Message Handling

Events are processed in the main loop (src/main.rs:369-425):

  • response.audio.delta: Audio chunks from AI
  • response.audio_transcript.done: AI response transcription
  • conversation.item.input_audio_transcription.completed: User speech transcription

Audio Processing Pipeline

Input stream captures audio, converts to PCM16, and sends via WebSocket. Output stream reads from a shared queue and plays audio chunks. Sample rate conversion handles devices that don’t support 24kHz natively.

Dependencies

Core dependencies from Cargo.toml:

  • tokio 1.41 - Async runtime
  • tokio-tungstenite 0.24 - WebSocket client
  • cpal 0.15 - Cross-platform audio
  • serde/serde_json - JSON serialization
  • base64 0.22 - Audio encoding
  • dotenv 0.15 - Environment variables

Error Handling

The implementation includes handling for:

  • Missing API keys
  • WebSocket connection failures
  • Audio device initialization errors
  • Sample rate mismatches
  • Buffer underruns/overruns

Common issues and solutions:

  • 403 errors indicate API key lacks Realtime API access
  • Audio feedback can be resolved using headphones
  • Garbled audio is automatically handled through proper buffering

Limitations

This is a minimal implementation focused on demonstrating core functionality. Production deployments should add:

  • Reconnection logic for network failures
  • Audio device selection UI
  • Error recovery mechanisms
  • Conversation history persistence
  • Multiple voice model selection

The repository serves as a foundation for building voice-enabled applications with OpenAI’s Realtime API in Rust, providing the essential WebSocket and audio handling infrastructure needed for real-time AI conversations. I hope you enjoy tinkering with it.