Add Voice to Your Agent - New SDK Feature Released

Cloudflare's new voice pipeline enhances AI agents with real-time voice interactions. This feature allows for natural communication, improving user experience. Developers can easily integrate voice capabilities into their existing agents, making them more versatile and user-friendly.

AI & SecurityMEDIUMUpdated: Apr 15, 2026Published: Apr 15, 2026

Featured image for Add Voice to Your Agent - New SDK Feature Released

Original Reporting

CFCloudflare Blog·Sunil Pai

AI Summary

CyberPings AI·Reviewed by Rohit Rana

🎯Basically, you can now talk to your AI agents instead of just typing.

What Happened

Cloudflare has introduced an experimental voice pipeline for its Agents SDK, allowing developers to add real-time voice interactions to AI agents. This feature enables continuous speech-to-text (STT) and text-to-speech (TTS) capabilities, making it easier for users to interact with agents in a more natural way. The implementation is straightforward, requiring only about 30 lines of server-side code.

How It Works

The voice pipeline integrates seamlessly with the existing Agents SDK architecture. Each agent is a Durable Object, which maintains its own state and can handle WebSocket connections. Here’s a high-level breakdown of the voice interaction process:

Audio Transport: The browser captures audio from the microphone and streams it over a WebSocket connection.
STT Session Setup: A continuous transcriber session is created when the voice call starts.
STT Input: Audio streams are sent continuously to the transcriber.
STT Turn Detection: The speech-to-text model detects when the user finishes speaking and generates a transcript.
LLM/Application Logic: The transcript is passed to the agent’s logic for processing.
TTS Output: The agent's response is converted to audio and sent back to the user.
Persistence: All messages are stored in an SQLite database, ensuring conversation history is maintained.

Key Features

The voice pipeline includes several components:

✨

withVoice(Agent)

For full conversational voice agents.

🔧

withVoiceInput(Agent)

For speech-to-text only, suitable for dictation or voice search.

📊

VoiceClient

A framework-agnostic client for non-React applications.

🚀

Built-in AI providers

Built-in AI providers for STT and TTS, allowing developers to start without needing external API keys.

Why This Matters

Adding voice capabilities enhances the user experience by allowing for more natural interactions with AI agents. Users can switch between text and voice seamlessly, as the same conversation history is shared across both modalities. This flexibility is crucial for applications where typing may not be convenient, such as during commutes or multitasking.

What You Should Do

Developers interested in leveraging this new feature can start by integrating the voice pipeline into their existing agents. The minimal server-side code provided by Cloudflare serves as a solid foundation. Additionally, exploring the various hooks and components available will help tailor the voice experience to specific use cases. The voice pipeline not only simplifies the integration of voice but also ensures that the agents remain versatile and capable of handling complex interactions.

🔒 Pro Insight

🔒 Pro insight: This new voice integration could significantly enhance user engagement, but developers must ensure robust error handling for real-time interactions.