Cloud & LLM Architecture¶
4Geeks AI Agents is built on a modern, cloud-native architecture designed for high availability, low latency, and enterprise-grade security. This document provides a comprehensive technical overview of the platform’s infrastructure, data flow, and system design.
High-Level Architecture¶
The 4Geeks AI Agents platform is composed of five core layers that work together to deliver intelligent, automated business processes:
graph TB
subgraph Client["Client Layer"]
WA[WhatsApp Business]
PH[Phone / PSTN]
WEB[Web / Console]
API[Custom API Clients]
end
subgraph Channel["Channel Layer"]
WAPI[WhatsApp Business API]
VAPI[VAPI Telephony]
REST[REST API Gateway]
end
subgraph Core["Core Orchestration Layer"]
AGENT[Agent Runtime Engine]
WORKFLOW[Workflow Engine]
HITL[Human-in-the-Loop Orchestrator]
PROMPT[Prompt Management System]
end
subgraph AI["AI / Intelligence Layer"]
GATEWAY[Private AI Gateway]
RAG[RAG Pipeline]
EMBED[Embedding Service]
WA[Website Analyzer]
end
subgraph Infra["Infrastructure Layer"]
SUPA[Supabase Platform]
EDGE[Edge Functions]
STORAGE[Object Storage]
DB[(PostgreSQL + pgvector)]
QUEUE[Message Queue]
CACHE[Redis Cache]
end
WA --> WAPI
PH --> VAPI
WEB --> REST
API --> REST
WAPI --> AGENT
VAPI --> AGENT
REST --> AGENT
AGENT --> WORKFLOW
AGENT --> HITL
AGENT --> PROMPT
AGENT --> GATEWAY
AGENT --> RAG
GATEWAY --> LLM1[OpenAI GPT-4o]
GATEWAY --> LLM2[Anthropic Claude]
GATEWAY --> LLM3[Google Gemini]
GATEWAY --> LLM4[Meta Llama]
GATEWAY --> LLM5[Groq LPU]
RAG --> EMBED
EMBED --> DB
RAG --> DB
WA --> GATEWAY
AGENT --> EDGE
WORKFLOW --> EDGE
EDGE --> SUPA
STORAGE --> SUPA
DB --> SUPA Layer-by-Layer Breakdown¶
1. Client Layer¶
The Client Layer represents all the external touchpoints through which end-users interact with AI Agents. This layer is protocol-agnostic, meaning the core platform does not care whether a message arrives via WhatsApp, a phone call, or a direct API call.
| Channel | Protocol | Direction | Description |
|---|---|---|---|
| WhatsApp Business | HTTPS (Meta Cloud API) | Inbound + Outbound | Text messages, media, interactive buttons, and templates delivered through the official WhatsApp Business Platform |
| Phone / PSTN | SIP / WebRTC (via VAPI) | Inbound + Outbound | Voice calls with real-time speech-to-text and text-to-speech conversion |
| Web Console | HTTPS / WebSocket | Bidirectional | The 4Geeks dashboard for agent management, monitoring, and the playground testing environment |
| Custom API | REST (OpenAI-compatible) | Bidirectional | Programmatic access for custom integrations using a standard OpenAI-compatible API format |
2. Channel Layer¶
The Channel Layer acts as a protocol adapter, normalizing incoming messages from different communication channels into a unified internal format. This abstraction allows the Core Orchestration Layer to process all messages identically, regardless of their origin.
WhatsApp Business API Adapter¶
The WhatsApp adapter connects to Meta’s Cloud API and handles:
- Webhook Management — Receives real-time message delivery events from Meta
- Message Normalization — Converts WhatsApp message formats (text, image, audio, document, interactive) into the internal unified message schema
- Template Management — Manages pre-approved WhatsApp message templates for outbound proactive messages
- Session Management — Tracks 24-hour conversation windows as required by Meta’s policies
- Media Handling — Downloads and processes media files (images, audio, documents) sent via WhatsApp
- Delivery Receipts — Tracks message delivery and read status
VAPI Telephony Adapter¶
The Voice API (VAPI) adapter manages all voice interactions:
- SIP Trunking — Establishes and terminates phone calls via Session Initiation Protocol
- Audio Streaming — Streams bidirectional audio between the caller and the AI processing pipeline
- DTMF Detection — Captures touch-tone inputs from the caller’s keypad
- Call Transfer — Routes calls to human agents when escalation is needed
- Call Recording — Records calls for quality assurance and compliance (with consent)
- Phone Number Management — Provisions and manages dedicated phone numbers across countries
3. Core Orchestration Layer¶
The Core Orchestration Layer is the brain of the platform. It receives normalized messages from the Channel Layer and coordinates all processing, decision-making, and response generation.
Agent Runtime Engine¶
The Agent Runtime Engine is responsible for executing the lifecycle of each agent interaction:
sequenceDiagram
participant C as Channel
participant A as Agent Runtime
participant P as Prompt Manager
participant R as RAG Pipeline
participant G as AI Gateway
participant W as Workflow Engine
participant H as HITL Orchestrator
C->>A: Incoming message (normalized)
A->>A: Identify agent & load configuration
A->>P: Retrieve system prompt + conversation history
P-->>A: Full prompt context
A->>R: Query knowledge base (if configured)
R-->>A: Retrieved context chunks
A->>A: Assemble final prompt (system + context + history + user message)
A->>G: Send to LLM via Private AI Gateway
G-->>A: LLM response
A->>A: Parse response for tool calls
alt Tool calls detected
A->>W: Execute workflow/tool
W-->>A: Tool execution result
A->>G: Send follow-up with tool results
G-->>A: Final LLM response
end
alt Confidence below threshold
A->>H: Escalate to human
H-->>A: Human resolution
A->>A: Learn from resolution
end
A-->>C: Send response (normalized back to channel format) Key responsibilities:
- Agent Identification — Routes incoming messages to the correct agent based on channel, phone number, or conversation context
- Context Assembly — Gathers all relevant context including system prompt, conversation history, knowledge base results, and user metadata
- Multi-Turn Conversation Management — Maintains conversation state across multiple message exchanges with configurable memory windows
- Tool Call Orchestration — Detects when the LLM wants to execute a tool (CRM lookup, calendar booking, etc.) and manages the execution pipeline
- Confidence Scoring — Evaluates the agent’s confidence in its response and triggers human escalation when confidence falls below configurable thresholds
- Rate Limiting — Enforces per-agent and per-user rate limits to prevent abuse and manage costs
- Error Handling — Implements retry logic, fallback responses, and graceful degradation when services are unavailable
Workflow Engine¶
The Workflow Engine executes predefined business logic and tool integrations:
- Trigger Evaluation — Evaluates event triggers (new message, scheduled time, external webhook) to initiate workflows
- Action Execution — Runs actions such as CRM updates, calendar bookings, email sending, and webhook calls
- Conditional Logic — Supports branching, loops, and conditional execution based on data values
- State Management — Persists workflow state across long-running processes (e.g., multi-step approval flows)
- Error Recovery — Automatically retries failed actions with exponential backoff
- Execution Logging — Records every step of workflow execution for auditing and debugging
Human-in-the-Loop (HITL) Orchestrator¶
The HITL Orchestrator manages the interaction between AI agents and human team members:
- Escalation Routing — Routes complex or low-confidence interactions to the appropriate human specialist
- Queue Management — Maintains priority queues for human review based on urgency, customer tier, and wait time
- Context Handoff — Provides human reviewers with full conversation context, AI reasoning, and suggested actions
- Learning Loop — Captures human resolutions and feeds them back into the agent’s training pipeline
- SLA Monitoring — Tracks escalation response times and alerts management when SLAs are at risk
Prompt Management System¶
The Prompt Management System handles all aspects of prompt engineering and versioning:
- Prompt Versioning — Maintains version history for all system prompts with rollback capability
- A/B Testing — Supports running multiple prompt variants simultaneously to measure performance
- Dynamic Injection — Injects runtime variables (customer name, order details, appointment times) into prompts
- Template Library — Provides pre-built prompt templates for common agent types (Sales, Support, Receptionist, etc.)
- Website Analysis Integration — Auto-generates initial prompts by analyzing business websites using Groq AI compound models
4. AI / Intelligence Layer¶
The AI / Intelligence Layer provides the machine learning capabilities that power agent intelligence.
Private AI Gateway¶
The Private AI Gateway is the central routing and governance layer for all LLM interactions. It provides:
- Unified API — A single OpenAI-compatible API endpoint that routes to 100+ models across multiple providers
- Multi-Model Orchestration — Dynamically selects the optimal model for each request based on task type, latency requirements, and cost constraints
- Automatic Fallbacks — If a provider experiences downtime, requests are automatically rerouted to an equivalent model
- Load Balancing — Distributes requests across multiple API keys and endpoints to avoid rate limits
- Real-Time Token Auditing — Tracks every token consumed with granular attribution to specific agents, workflows, and customers
- Budget Enforcement — Enforces hard spending limits with automated alerts at 50%, 80%, and 100% thresholds
- Zero Data Retention — Prioritizes providers that guarantee no data is stored or used for model training
- Context Scrutiny — Applies guardrails to sanitize sensitive information before sending data to external LLMs
RAG Pipeline¶
The Retrieval-Augmented Generation (RAG) pipeline enables agents to answer questions using proprietary business documents. See the Knowledge Base documentation for full details.
Pipeline stages:
- Ingestion — Documents are uploaded and text is extracted (supporting PDF, DOCX, TXT, CSV, MD)
- Chunking — Text is split into semantic chunks of approximately 500 tokens each, preserving paragraph and section boundaries
- Embedding — Each chunk is converted into a high-dimensional vector embedding using a dedicated embedding model
- Indexing — Embeddings are stored in a pgvector-enabled PostgreSQL database with HNSW (Hierarchical Navigable Small World) indexes for fast approximate nearest-neighbor search
- Retrieval — At query time, the user’s message is embedded and compared against stored vectors using cosine similarity
- Re-ranking — Retrieved chunks are re-ranked using a cross-encoder model for improved relevance
- Context Injection — The top-ranked chunks are injected into the LLM prompt as grounding context
Retrieval configuration:
| Parameter | Default | Description |
|---|---|---|
| Top-K | 5 | Number of chunks to retrieve |
| Similarity Threshold | 0.7 | Minimum cosine similarity score |
| Max Context Length | 3,000 tokens | Maximum total tokens from knowledge base |
| Chunk Size | ~500 tokens | Target size for each text chunk |
| Chunk Overlap | 50 tokens | Overlap between consecutive chunks |
Embedding Service¶
The Embedding Service converts text into vector representations:
- Model — Uses OpenAI
text-embedding-3-small(1,536 dimensions) as the default embedding model - Batching — Processes multiple chunks in parallel for efficient document ingestion
- Normalization — Applies L2 normalization to embeddings for consistent cosine similarity calculations
- Versioning — Supports embedding model versioning to allow re-embedding when models are upgraded
Website Analyzer¶
The Website Analyzer uses Groq AI compound models to automatically generate system prompts:
- Web Scraping — Custom scraper extracts text content from target websites
- AI Analysis — Groq’s compound model analyzes content to identify business context, services, tone, and FAQs
- Web Search — Supplementary web search gathers additional business context
- Prompt Generation — A structured system prompt is generated incorporating all extracted information
5. Infrastructure Layer¶
The Infrastructure Layer provides the foundational compute, storage, and networking services.
Supabase Platform¶
4Geeks AI Agents leverages Supabase as the primary backend platform:
| Component | Purpose |
|---|---|
| PostgreSQL | Primary relational database for agent configurations, conversation logs, user accounts, and workflow state |
| pgvector | Vector extension for PostgreSQL enabling semantic search over knowledge base embeddings |
| Supabase Storage | Object storage for uploaded knowledge base files, call recordings, and media attachments |
| Supabase Auth | Authentication and authorization for the console dashboard and API access |
| Edge Functions | Serverless Deno-based functions for real-time processing (VAPI tool handling, knowledge base processing, webhook handling) |
| Realtime | WebSocket-based real-time subscriptions for live dashboard updates |
Edge Functions¶
The platform runs several critical Deno-based edge functions:
| Function | Purpose |
|---|---|
| vapi-tool-handler | Processes tool calls from voice agents during phone calls (calendar booking, CRM lookups, call transfers) |
| process-knowledge | Handles document ingestion pipeline: text extraction, chunking, embedding, and storage |
| whatsapp-webhook | Receives and processes incoming WhatsApp messages from Meta’s Cloud API |
| workflow-trigger | Evaluates and executes workflow triggers based on incoming events |
| website-analyzer | Scrapes websites and generates system prompts using Groq AI |
Message Queue¶
An internal message queue decouples components and ensures reliable processing:
- Asynchronous Processing — Long-running tasks (document processing, campaign calls) are queued for background execution
- Retry Logic — Failed messages are automatically retried with configurable retry policies
- Dead Letter Queue — Messages that fail after maximum retries are routed to a dead letter queue for manual inspection
- Ordering Guarantees — Conversation messages are processed in order to maintain context integrity
Redis Cache¶
Redis provides low-latency caching for frequently accessed data:
- Session State — Stores active conversation sessions for fast context retrieval
- Rate Limiting — Implements sliding-window rate limits for API and agent access
- Configuration Cache — Caches agent configurations to reduce database load
- Pub/Sub — Enables real-time communication between distributed components
Data Flow¶
Text Message Flow (WhatsApp / API)¶
sequenceDiagram
participant U as End User
participant WA as WhatsApp / API
participant GW as API Gateway
participant AR as Agent Runtime
participant KB as Knowledge Base
participant LLM as LLM (via Gateway)
participant DB as Database
U->>WA: Sends text message
WA->>GW: Webhook / API request
GW->>GW: Authenticate & rate limit
GW->>AR: Forward normalized message
AR->>DB: Load conversation history
AR->>KB: Semantic search for relevant context
KB-->>AR: Return top-K chunks
AR->>AR: Assemble prompt (system + context + history + message)
AR->>LLM: Request completion
LLM-->>AR: Generate response
AR->>DB: Save interaction & update credits
AR-->>GW: Return response
GW-->>WA: Deliver response
WA-->>U: Display response Voice Call Flow (Phone)¶
sequenceDiagram
participant U as Caller
participant V as VAPI (Telephony)
participant DG as Deepgram (STT)
participant AR as Agent Runtime
participant LLM as LLM (via Gateway)
participant EL as ElevenLabs (TTS)
participant DB as Database
U->>V: Initiates phone call
V->>DG: Stream audio for transcription
DG-->>AR: Real-time transcript
AR->>AR: Assemble prompt
AR->>LLM: Request completion
LLM-->>AR: Generate response
alt Tool call needed
AR->>AR: Execute tool (calendar, CRM, etc.)
AR->>LLM: Follow-up with tool result
LLM-->>AR: Final response
end
AR->>EL: Send text for synthesis
EL-->>V: Stream audio back
V-->>U: Hear AI response
AR->>DB: Log call data & update credits Knowledge Base Ingestion Flow¶
sequenceDiagram
participant U as User
participant C as Console
participant EF as Edge Function
participant S as Storage
participant E as Embedding Service
participant DB as PostgreSQL + pgvector
U->>C: Upload document
C->>S: Store raw file
C->>EF: Trigger processing
EF->>S: Download file
EF->>EF: Extract text (PDF/DOCX/TXT/CSV/MD)
EF->>EF: Chunk text (~500 tokens per chunk)
EF->>E: Batch embed chunks
E-->>EF: Return vectors
EF->>DB: Store chunks + embeddings + metadata
EF->>C: Update status to "Ready"
C-->>U: Document available for retrieval Security Architecture¶
Network Security¶
| Layer | Protection |
|---|---|
| Transport | All data in transit is encrypted using TLS 1.3 |
| API Gateway | Rate limiting, DDoS protection, and IP allowlisting |
| VPC | Core services run within a private Virtual Private Cloud |
| Firewall | Network-level firewall rules restrict access to authorized services only |
| DNS | DNSSEC-enabled domains prevent DNS spoofing attacks |
Data Security¶
| Concern | Implementation |
|---|---|
| Encryption at Rest | All databases and storage buckets use AES-256 encryption |
| Encryption in Transit | TLS 1.3 for all internal and external communications |
| Zero Data Retention | LLM providers are contractually bound to not store or train on customer data |
| Data Isolation | Each customer’s data is logically isolated using row-level security (RLS) policies in PostgreSQL |
| PII Scrubbing | Sensitive information can be automatically redacted before sending to LLMs |
| Audit Logging | All data access is logged with full audit trails |
Authentication & Authorization¶
| Component | Method |
|---|---|
| Console Access | Supabase Auth with MFA support |
| API Access | API keys with scoped permissions |
| WhatsApp Webhooks | Meta’s webhook signature verification |
| Edge Functions | JWT-based authentication for all function invocations |
| Row-Level Security | PostgreSQL RLS policies ensure data isolation between tenants |
Compliance¶
| Standard | Status |
|---|---|
| GDPR | Compliant — data processing agreements, right to erasure, data portability |
| EU AI Act | Compliant — transparency obligations, risk assessments, human oversight |
| SOC 2 | In progress — security, availability, and confidentiality controls |
| HIPAA | Available via custom configurations for healthcare clients |
| PCI-DSS | Available via custom configurations for payment processing |
Scalability & Performance¶
Horizontal Scaling¶
The platform is designed to scale horizontally across all components:
- Stateless API Servers — API gateway instances can be added or removed based on load
- Edge Functions — Supabase Edge Functions auto-scale based on request volume
- Database — PostgreSQL read replicas handle increased query load; write operations are optimized with connection pooling via PgBouncer
- Cache — Redis Cluster provides distributed caching with automatic sharding
Performance Targets¶
| Metric | Target | Measurement |
|---|---|---|
| Text Response Latency | < 2 seconds | Time from message receipt to response delivery |
| Voice Response Latency | ~1.2 seconds | Total round-trip: STT + LLM + TTS |
| Knowledge Base Retrieval | < 200ms | Time for semantic search across 10,000+ chunks |
| API Availability | 99.9% | Monthly uptime SLA |
| Concurrent Agents | 1,000+ | Simultaneous active agent instances |
| Messages Per Second | 500+ | Throughput capacity for text interactions |
Latency Breakdown (Voice)¶
| Component | Typical Latency | Technology |
|---|---|---|
| Speech-to-Text | ~300ms | Deepgram streaming transcription |
| LLM Processing | ~500ms | GPT-4o-mini via Private AI Gateway |
| Text-to-Speech | ~400ms | ElevenLabs low-latency streaming |
| Network Overhead | ~50ms | Internal routing and protocol overhead |
| Total | ~1,250ms | End-to-end voice response time |
Monitoring & Observability¶
Logging¶
All platform components emit structured JSON logs:
- Application Logs — Agent interactions, workflow executions, error traces
- Infrastructure Logs — Edge function invocations, database queries, cache hits/misses
- Security Logs — Authentication events, authorization failures, suspicious activity
Metrics¶
Key performance indicators are tracked in real-time:
- Agent Performance — Response times, confidence scores, escalation rates
- Credit Consumption — Per-agent, per-channel, per-customer credit usage
- System Health — CPU, memory, disk, network utilization across all services
- LLM Provider Health — Latency, error rates, and availability per provider
Alerting¶
Automated alerts notify the operations team when:
- Error rates exceed thresholds
- Response latencies degrade
- Credit consumption spikes unexpectedly
- LLM providers experience outages
- Security events are detected
What’s Next¶
- LLM Models — Detailed breakdown of the language models powering the platform
- The Human Team — Meet the team behind the human-in-the-loop orchestration
- Knowledge Base (RAG) — Configure domain-specific knowledge for your agents
- Voice AI & Campaigns — Deploy voice agents with phone integration
AĂşn con dudas? Pregunta en Discord o explore tutoriales