Voice is the most natural interface for human communication. The ability to talk, listen, and respond in real time distinguishes human interactions and represents the next evolution in how telecom providers engage with subscribers, from troubleshooting connectivity issues to handling billing queries. As AI capabilities rapidly advance, the frontier lies in enabling intelligent, real-time voice conversations within applications and services.
This article explains the integration of LiveKit version, an open-source WebRTC platform for real-time communications, with Amazon Bedrock’s Nova Sonic, a streaming-capable speech-to-speech large language model designed for voice-based interaction. The goal was to develop a plugin that powers real-time AI conversations, including voice input, smart interruption handling, and natural voice responses, mimicking a human agent.
The integration represents a significant advancement in conversational AI technology, particularly for telecom customer service applications. By combining the real-time media capabilities of LiveKit with the advanced language understanding and speech generation of Nova Sonic, this solution creates a seamless bridge between human callers and AI systems. For telecom providers, this means more efficient customer service, reduced wait times, and improved customer satisfaction through natural, intelligent voice support.
The Motivation
Developing a voice assistant required streaming live user audio, processing it with a conversational model, and sending a voice reply — all with minimal latency. Amazon Connect offers telephone-based communication capabilities and is working towards supporting the Amazon Nova Sonic Speech to Speech model.. LiveKit provided robust, scalable infrastructure for audio communication, but lacked native support for intelligent voice interaction for the Amazon Nova Sonic model.
Amazon recently introduced Nova Sonic, a powerful, large language model designed for bidirectional streaming audio. Nova Sonic can ingest audio input in real time, transcribe speech, generate intelligent responses, and synthesize high-quality voice output. This makes it an ideal engine for dynamic, voice-based agents capable of seamless conversations.
However, no direct bridge existed between LiveKit and Nova Sonic — two powerful but separate systems. Since Livekit provides plugins for OpenAI, Google multimodal, a similar plugin could be built for the AWS Nova Sonic model.
In telecom, where millions of customer service calls are handled daily, intelligent voice agents can reduce operational costs, shorten wait times, and improve customer satisfaction through personalized, real-time support. Additionally, these AI agents can operate 24/7, eliminating wait times during peak periods and providing consistent service quality regardless of call volume or time of day.
The Challenge
Connecting LiveKit to Nova Sonic involved more than simply linking a microphone to a speaker. These systems operate on different protocols and data formats: LiveKit streams raw PCM audio over WebRTC, while Nova Sonic requires structured audio chunks sent via Amazon Bedrock’s asynchronous streaming API.
Bridging these required not only audio transformation, but also intelligent stream management, event handling, and voice activity detection (VAD) — all executed with minimal latency.
Additionally, while LiveKit supports multimodal media streaming through an official plugin, integrating Nova Sonic’s advanced audio streaming necessitated creating a custom plugin to handle bidirectional audio streams within LiveKit’s multimodal architecture.
Nova Sonic Custom Plugin
The solution involved building a Python plugin acting as a bridge between LiveKit and Nova Sonic within the multimodal framework. This bridge operates through a seamless flow of data.
- LiveKit captures user audio in real time.
- The plugin converts LiveKit’s raw audio stream into chunks compatible with Bedrock’s streaming API.
- Audio chunks stream into Nova Sonic while the plugin simultaneously listens for the model’s streaming output.
- Nova Sonic’s synthesized voice responses stream back through the plugin.
- The plugin forwards the audio response to LiveKit, playing it back to the user.
This architecture creates a natural, fluid conversation loop, enabling interactions that feel closer to speaking with a human assistant than typing into a chatbot.
The plugin implementation follows LiveKit’s multimodal architecture pattern but is specifically tailored for bidirectional audio streaming. The core component that manages real-time session initialization with Amazon Bedrock, audio processing workflows, event management between platforms, and tool integration for external API access. Bidirectional streaming system is implemented through three coordinated asynchronous tasks that handle outgoing audio chunks and control events, process incoming model responses and audio, and manage audio buffering and preprocessing.
Interruption detection allows the agent to detect when a user is speaking and immediately stop its current response, creating a more natural conversation flow by avoiding talking over the user. The immediate responses from the Nova Sonic model make this more user-friendly.
The tool integration framework supports function calling through a JSON schema conversion system that transforms Python function definitions into compatible schemas for Nova Sonic, enabling the voice agent to access external systems, databases, or APIs during conversations.
This technical architecture ensures that the plugin can handle the complex requirements of real-time, bidirectional audio streaming while maintaining the flexibility needed for natural conversations and external system integration.
To ensure compatibility with the Nova Sonic model via Amazon Bedrock, the environment was configured with Python 3.12, which is required for the aws_sdk_bedrock_runtime package to function correctly with streaming support. The plugin uses the aws_sdk_bedrock_runtime version 0.0.1 for interacting with the Bedrock streaming API. For real-time media transport and plugin integration, LiveKit version 0.20.4 and livekit-agents version 0.12.17 were used, providing stable support for multimodal streaming and low-latency audio handling. A plugin for the latest version of LiveKit has also been developed, ensuring compatibility with its new features.
Agent Implementation Highlights
Leveraging the existing multimodal architecture in LiveKit, which supports concurrent streaming and processing of diverse media, a similar architecture was developed specifically for integrating the Nova Sonic model. This integration supports bidirectional audio streaming and dynamic agent responses within the LiveKit ecosystem.
The custom plugin hooks into LiveKit’s media stream, placing audio chunks into an asynchronous queue. This design allows smooth buffering and decouples input capture from downstream processing.
Using Amazon Bedrock’s streaming API, the plugin pushes audio chunks to Nova Sonic and listens for its output in real time. This bidirectional streaming enables the assistant to begin speaking even while continuing to receive input — facilitating a truly conversational experience.
When Nova Sonic generates a response (in text or audio), the plugin streams the audio back to the user via LiveKit. If only text is returned, Amazon Polly is used for high-quality text-to-speech synthesis. This closed-loop architecture delivers natural, human-like replies directly within the LiveKit call.
The agent can also be enhanced by leveraging external tools and accessing external knowledge bases (KBs). By integrating APIs and retrieval-augmented generation (RAG) systems, the plugin enables the voice agent to fetch real-time information, perform actions, or answer complex queries, making interactions more intelligent and task-oriented.
To deliver responses such as greetings or onboarding instructions, the plugin uses Amazon Polly for high-quality text-to-speech (TTS) synthesis. Polly is used for ensuring the assistant always maintains a responsive, human-like voice output for initial context and direct speech from text.
Plugin-Level Error Monitoring with Amazon CloudWatch is added. To ensure stable and reliable operation, the custom plugin includes basic error detection and reporting via Amazon CloudWatch. This plugin-level monitoring focuses on capturing and logging issues such as Streaming failures. Unexpected model responses (e.g., timeouts, malformed transcripts), Audio encoding/decoding errors. CloudWatch logs allow engineers to diagnose issues in production and set alerts for critical failures. This logging helps the voice assistant maintain partial system faults.
Integrating Telephony
To enable real-world voice interactions with customers over the public switched telephone network (PSTN), the system integrates with two telephony platforms:
- Inbound Calls with Twilio: For incoming customer calls, Twilio is used to receive calls via SIP, and connect them into a LiveKit audio session. This allows the plugin to handle real-time conversations from regular phone lines, seamlessly converting PSTN audio to WebRTC for processing. Call transferring can also be enabled in twilio if needed.
- Outbound Calls with Amazon Chime SDK: For initiating outbound calls, the system uses Amazon Chime integration. Calls are placed using the Chime SDK and bridged into a LiveKit session where Nova Sonic handles real-time responses.
This telephony bridge makes it possible to deploy the real-time AI assistant into production environments, supporting both incoming and outgoing customer communications.
Use Cases
This integrated system unlocks a variety of real-world applications:
- Voice-based customer support that feels like real human agents helps users resolve issues like billing disputes or slow internet. These agents can listen actively, respond to interruptions, and handle multi-turn conversations. They can also escalate to human support or trigger automated backend actions (e.g., transferring to IVR, getting subscriber-specific data) when necessary. This enhances customer satisfaction and reduces the load on live agents.
- Voice-driven technical assistants for internal tools or automation that guide users through modem resets or network diagnostics. For telecom technicians, field agents, or call center staff, these assistants can walk users through troubleshooting steps, validate device statuses via APIs, or even execute remote commands. This eliminates the need for detailed manuals or extensive training and reduces resolution time.
- Conversational knowledge access over retrieval-augmented generation (RAG) knowledge systems(e.g., FAQs, internet plans, device manuals). Users can ask natural language questions like “How do I change my Wi-Fi password?” or “What are the plans available?” The system fetches accurate, personalized answers by combining semantic search with generative AI. This improves accessibility to support content without needing users to browse complex help sites.
- LiveKit’s scalable infrastructure, this architecture can support concurrent users interacting in real time. By combining LiveKit’s real-time communication engine with the intelligence and streaming capabilities of Amazon Bedrock’s Nova Sonic, it’s now possible to build voice agents that feel natural, smart, and human-aware.
Creating a custom plugin to bridge these technologies within LiveKit’s multimodal framework enabled rich, dynamic conversations that push the boundaries of real-time voice AI.
Telecom Use Case Example
Imagine this interaction:
Assistant: “Hello, I am an AI Agent. How can I help you?”
User: “I’m having issues with my internet connection.”
(LiveKit streams audio to Nova Sonic)
Assistant: “I’m really sorry to hear that you’re experiencing problems with your internet connection. Let’s go ahead and troubleshoot this together.”
Assistant: “Can you tell me if any lights are blinking on your router? If not ..”
(User interrupts mid-sentence)
User: “Yeah, the Internet light is red.”
(model detects interruption, assistant stops response, starts a new one)
Assistant: “Thanks for letting me know. That usually indicates a service disruption.
Can you try restarting your router by unplugging it, waiting 10 seconds, and plugging it back in?”
User: “Okay, done.”
Assistant: “Great. Let’s wait a minute for the router to fully reboot.
If the light turns green or white, your connection should be restored.”
(Pause for a moment)
User: “The light is green now.”
Assistant: “Perfect! Looks like the connection is back.
If it drops again, feel free to reach out. Thanks for your patience!”
General Use Case Example
For a generic use case, we have selected an airline booking scenario where the virtual agent can seamlessly engage in a natural conversation with the user.
Agent: Thank you for calling SkyJet Airlines. This is Maria, your virtual assistant. How can I help you today?
Customer: Hi, I’d like to book a flight for next week. Can you help me with that?
Agent: Of course! I’d be happy to help. May I know your departure city, destination, and preferred travel dates?
Customer: I’m flying from New York to Los Angeles. I’d like to leave on June 12 and return on June 18.
Agent: Got it! Do you have a preferred time of day for departure and return? Morning, afternoon, or evening?
Customer: Morning flights would be great, if possible.
Agent: Perfect. Let me check the available options for morning flights on those dates. One moment please…
Okay, I found a few options:
Delta Airlines – Departing June 12 at 8:05 AM, returning June 18 at 9:10 AM.
American Airlines – Departing June 12 at 7:45 AM, returning June 18 at 10:30 AM.
JetBlue – Departing June 12 at 9:20 AM, returning June 18 at 8:55 AM.
Would you like me to hold a seat on one of these?
Customer: The Delta One sounds good. What’s the price?
Agent: The Delta round-trip fare is currently $385, including taxes and fees. This is for economy class with one checked bag included.
Customer: Okay, go ahead and book that for me.
Agent: Sure! May I have your full name as it appears on your ID, date of birth, and contact email?
Customer: Yes, it’s Alex Turner, born March 14, 1985. My email is user@example.com
Agent: Thank you, Alex. Your flight is now booked. You’ll receive a confirmation email shortly with your itinerary and e-ticket number. Is there anything else I can help you with?
Customer: No, that’s all. Thanks for your help!
Agent: You’re very welcome. Have a great trip to Los Angeles!
This seamless flow is only possible because the system can listen, process, and speak simultaneously, with awareness of when to pause or adapt based on user interruptions.
Conclusion
This article serves as a part of a series dedicated to exploring persona-driven AI Agent solutions with chat and voice skills using Amazon Nova Sonic. We will also provide detailed implementation insights, share best practices, and explore strategies to enhance the performance and scalability of these AI Agents across various industries.
The solution was developed by the team at Innovation Incubator — Riya George, Vishal Raju, Stalin Pereira, Kailash Prem, Albin Varun and Anil Augustine — under the guidance of Sandeep Kumar Siddhanti, Thomas George, Archana Sekhar, and Antony Satyadas, with support from Tony Stout (CDG) and AWS.
For more information, contact thomas.george@innovationincubator.com, vishal.r@iinerds.com, antony@innovationincubator.com, tony.stout@cdg.us