Building an AI Call Assistant: Chapter 2 - The Intelligence Layer
This post is the second in a three-part series on building an AI call assistant, detailing how to build the "Intelligence layer" that processes transcripts, maintains conversation context, and generates intelligent responses by integrating pluggable LLM providers with a knowledge base and session orchestration.
Series Overview: We're building an AI-powered call assistant that acts as your personal representative handling incoming calls, answering questions from your knowledge base, and managing interactions while you stay in control. This three-part series breaks down the challenge into core components:
Chapter 1: The Listening Layer ✅ : Accepting inbound calls and performing real-time Speech-to-Text transcription
Chapter 2: The Intelligence Layer (this post) : Understanding context with LLMs, managing conversation state, and generating intelligent responses
Chapter 3: The Voice Layer 🚧 : Speaking back to callers with low-latency Text-to-Speech
In Chapter 1 , we built the foundation for receiving and understanding phone calls. We set up Twilio to receive calls and stream audio via WebSockets to our FastAPI server. We also implemented a pluggable STT (Speech-to-Text) architecture that allowed us to listen to the caller using ElevenLabs or Mistral, printing their words to the terminal in real-time.
Comments
The Current State: We can hear what callers say, but our assistant can't "think" or respond.
In this chapter, we build the "Intelligence Layer."
To give our assistant a mind of its own, we need to introduce several new abstractions into our clean architecture. These all sit in the app/services/ directory:
llm/ (The Pre-frontal Cortex): A pluggable interface for LLMs (Mistral, Gemini) mirroring our STT factory approach.
knowledge/ (Long-term Memory): A simple YAML-backed knowledge base (We keeping it simple) so the AI knows facts about you or your business.
conversation/ (Short-term Memory): A manager that tracks the back-and-forth history of a specific call.
orchestrator/ (The Conductor): The logic that ties everything together. It handles the CallSession, decides when to respond, and delegates tasks to the LLM.
Just like we did with STT providers, we want the ability to swap the underlying large language model without rewriting our application logic. We define a BaseLlmClient interface:
app/services/llm/base.py
from abc import ABC, abstractmethod
from typing import AsyncIterator, Optional
from app.schemas.llm import ChatMessage, LlmResponse
classBaseLlmClient(ABC):@abstractmethodasyncdefgenerate_stream( self, messages:list[ChatMessage], system_prompt: Optional[str]=None,)-> AsyncIterator[str]:"""Stream response tokens. Crucial for low-latency voice later."""...
We chose to implement two providers out of the gate: Mistral and Gemini. Both implementations follow this interface, and we use a Factory (LlmFactory) to instantiate the desired client based on the LLM_PROVIDER in our .env file.
Switching models is a one-line config change:
LLM_PROVIDER=gemini # Or mistral
GEMINI_LLM_MODEL=gemini-2.5-flash
An AI assistant isn't very useful calling on your behalf if it doesn't know anything about you. We implemented a SimpleKnowledgeProvider that loads facts from a YAML file (data/knowledge.yaml).
data/knowledge.yaml
business_name:"Smooth Operator Consulting"operating_hours:"Monday through Friday, 9:00 AM to 5:00 PM Eastern Time."services:-name:"Software Architecture Review"price:"$500 per session"description:"A comprehensive 2-hour review of your system architecture."faq:-question:"Are you accepting new clients?"answer:"Yes, currently accepting new clients for Q3."
When a call starts, the knowledge provider reads this file and formats it into a dense string. This context is injected into the System Prompt of the LLM, giving it the ground-truth facts it needs to successfully answer caller inquiries like "How much does a review cost?"
The most complex part of a real-time voice AI system is orchestration. When does the AI speak? How does it remember what was said?
We solved this through the CallSession class. Think of this as the lifecycle manager for a single phone call. It bridges the Twilio WebSocket (audio bytes) with the ConversationAgent (AI logic).
Let's look at a significantly simplified excerpt of the Twilio WebSocket handler:
app/services/twilio/stream.py
asyncdefhandle_twilio_stream(ws: WebSocket)->None:await ws.accept()
session =None# ... message loop ...if event.event =="start":
call_sid = event.start.callSid
session = CallSession(call_sid)await session.start()elif event.event =="media":# Fast path: send raw audio directly to the session's STT processorawait session.process_audio(raw["media"]["payload"])
Notice how clean the stream handler is now? All the heavy lifting (managing the STT receive loop, maintaining conversational state) is delegated to CallSession. Let's peek inside that session:
app/services/orchestrator/session.py
classCallSession:asyncdef_on_transcript(self, kind:str, text:str)->None:ifnot text.strip():returnprint(f"[{self.call_sid}] {kind.upper()}: {text}")# If we have a committed transcript, ask the LLM what to sayif self.agent:
response_stream =await self.agent.process_transcript(
call_sid=self.call_sid,
kind=kind,
text=text,)# Stream the generated text line by line to the terminalif response_stream:print(f"[{self.call_sid}] ASSISTANT: ", end="", flush=True)asyncfor chunk in response_stream:print(chunk, end="", flush=True)print()
The ConversationAgent is where transcriptions go to turn into intelligent thoughts. It utilizes a ConversationManager to maintain an ongoing buffer of messages (user, assistant) to ensure the LLM has strict memory context of the current call.
When a transcript (specifically a committed transcript meaning the caller paused) enters process_transcript(), the Agent does the following:
Check Decision Logic: Does this transcript warrant a response? We use should_respond(kind) to ignore partial transcribing so we don't trip over the caller mid-sentence.
Update Memory: Appends the caller's text to the conversation history.
Assemble Prompt: Fetches the persistent System Prompt (which includes the loaded Knowledge Base YAML).
Generate: Dispatches to the LlmFactory's active client and passes the prompt and history.
Update Memory: Saves the generated response back to history as an assistant message.
Testing voice AI is notoriously annoying if you have to call a real phone number every time you change a line of code.
To fix this, we created a fantastic developer tool: scripts/simulate_brain.py
This script allows you to feed a pre-recorded .wav file into the full AI pipeline (STT -> Intelligence Layer(LLM) -> output) directly from your terminal, bypassing Twilio entirely.
# Record an audio clip uv run python -m scripts.record_audio "scripts/test_call.wav"
# Send it through the LLM uv run python -m scripts.simulate_brain "scripts/test_call.wav"
Output:
Using STT Provider: mistral Using LLM Provider: gemini
Partial: Hi, I'm calling Committed: Hi, I'm calling to ask about your services. [SIMULATE_BRAIN] ASSISTANT: Hello! I'm an AI assistant for Smooth Operator Consulting. How can I help you today? Are you interested in our Software Architecture Review?
Boom. A fully functional, thinking AI call assistant, running locally.
Comments