Cloud & LLM Architecture¶

4Geeks AI Agents is built on a modern, cloud-native architecture designed for high availability, low latency, and enterprise-grade security. This document provides a comprehensive technical overview of the platform’s infrastructure, data flow, and system design.

High-Level Architecture¶

The 4Geeks AI Agents platform is composed of five core layers that work together to deliver intelligent, automated business processes:

graph TB
    subgraph Client["Client Layer"]
        WA[WhatsApp Business]
        PH[Phone / PSTN]
        WEB[Web / Console]
        API[Custom API Clients]
    end

    subgraph Channel["Channel Layer"]
        WAPI[WhatsApp Business API]
        VAPI[VAPI Telephony]
        REST[REST API Gateway]
    end

    subgraph Core["Core Orchestration Layer"]
        AGENT[Agent Runtime Engine]
        WORKFLOW[Workflow Engine]
        HITL[Human-in-the-Loop Orchestrator]
        PROMPT[Prompt Management System]
    end

    subgraph AI["AI / Intelligence Layer"]
        GATEWAY[Private AI Gateway]
        RAG[RAG Pipeline]
        EMBED[Embedding Service]
        WA[Website Analyzer]
    end

    subgraph Infra["Infrastructure Layer"]
        SUPA[Supabase Platform]
        EDGE[Edge Functions]
        STORAGE[Object Storage]
        DB[(PostgreSQL + pgvector)]
        QUEUE[Message Queue]
        CACHE[Redis Cache]
    end

    WA --> WAPI
    PH --> VAPI
    WEB --> REST
    API --> REST

    WAPI --> AGENT
    VAPI --> AGENT
    REST --> AGENT

    AGENT --> WORKFLOW
    AGENT --> HITL
    AGENT --> PROMPT
    AGENT --> GATEWAY
    AGENT --> RAG

    GATEWAY --> LLM1[OpenAI GPT-4o]
    GATEWAY --> LLM2[Anthropic Claude]
    GATEWAY --> LLM3[Google Gemini]
    GATEWAY --> LLM4[Meta Llama]
    GATEWAY --> LLM5[Groq LPU]

    RAG --> EMBED
    EMBED --> DB
    RAG --> DB
    WA --> GATEWAY

    AGENT --> EDGE
    WORKFLOW --> EDGE
    EDGE --> SUPA
    STORAGE --> SUPA
    DB --> SUPA

Layer-by-Layer Breakdown¶

1. Client Layer¶

The Client Layer represents all the external touchpoints through which end-users interact with AI Agents. This layer is protocol-agnostic, meaning the core platform does not care whether a message arrives via WhatsApp, a phone call, or a direct API call.

Channel	Protocol	Direction	Description
WhatsApp Business	HTTPS (Meta Cloud API)	Inbound + Outbound	Text messages, media, interactive buttons, and templates delivered through the official WhatsApp Business Platform
Phone / PSTN	SIP / WebRTC (via VAPI)	Inbound + Outbound	Voice calls with real-time speech-to-text and text-to-speech conversion
Web Console	HTTPS / WebSocket	Bidirectional	The 4Geeks dashboard for agent management, monitoring, and the playground testing environment
Custom API	REST (OpenAI-compatible)	Bidirectional	Programmatic access for custom integrations using a standard OpenAI-compatible API format

2. Channel Layer¶

The Channel Layer acts as a protocol adapter, normalizing incoming messages from different communication channels into a unified internal format. This abstraction allows the Core Orchestration Layer to process all messages identically, regardless of their origin.

WhatsApp Business API Adapter¶

The WhatsApp adapter connects to Meta’s Cloud API and handles:

Webhook Management — Receives real-time message delivery events from Meta
Message Normalization — Converts WhatsApp message formats (text, image, audio, document, interactive) into the internal unified message schema
Template Management — Manages pre-approved WhatsApp message templates for outbound proactive messages
Session Management — Tracks 24-hour conversation windows as required by Meta’s policies
Media Handling — Downloads and processes media files (images, audio, documents) sent via WhatsApp
Delivery Receipts — Tracks message delivery and read status

VAPI Telephony Adapter¶

The Voice API (VAPI) adapter manages all voice interactions:

SIP Trunking — Establishes and terminates phone calls via Session Initiation Protocol
Audio Streaming — Streams bidirectional audio between the caller and the AI processing pipeline
DTMF Detection — Captures touch-tone inputs from the caller’s keypad
Call Transfer — Routes calls to human agents when escalation is needed
Call Recording — Records calls for quality assurance and compliance (with consent)
Phone Number Management — Provisions and manages dedicated phone numbers across countries

3. Core Orchestration Layer¶

The Core Orchestration Layer is the brain of the platform. It receives normalized messages from the Channel Layer and coordinates all processing, decision-making, and response generation.

Agent Runtime Engine¶

The Agent Runtime Engine is responsible for executing the lifecycle of each agent interaction:

sequenceDiagram
    participant C as Channel
    participant A as Agent Runtime
    participant P as Prompt Manager
    participant R as RAG Pipeline
    participant G as AI Gateway
    participant W as Workflow Engine
    participant H as HITL Orchestrator

    C->>A: Incoming message (normalized)
    A->>A: Identify agent & load configuration
    A->>P: Retrieve system prompt + conversation history
    P-->>A: Full prompt context
    A->>R: Query knowledge base (if configured)
    R-->>A: Retrieved context chunks
    A->>A: Assemble final prompt (system + context + history + user message)
    A->>G: Send to LLM via Private AI Gateway
    G-->>A: LLM response
    A->>A: Parse response for tool calls
    alt Tool calls detected
        A->>W: Execute workflow/tool
        W-->>A: Tool execution result
        A->>G: Send follow-up with tool results
        G-->>A: Final LLM response
    end
    alt Confidence below threshold
        A->>H: Escalate to human
        H-->>A: Human resolution
        A->>A: Learn from resolution
    end
    A-->>C: Send response (normalized back to channel format)

Key responsibilities:

Agent Identification — Routes incoming messages to the correct agent based on channel, phone number, or conversation context
Context Assembly — Gathers all relevant context including system prompt, conversation history, knowledge base results, and user metadata
Multi-Turn Conversation Management — Maintains conversation state across multiple message exchanges with configurable memory windows
Tool Call Orchestration — Detects when the LLM wants to execute a tool (CRM lookup, calendar booking, etc.) and manages the execution pipeline
Confidence Scoring — Evaluates the agent’s confidence in its response and triggers human escalation when confidence falls below configurable thresholds
Rate Limiting — Enforces per-agent and per-user rate limits to prevent abuse and manage costs
Error Handling — Implements retry logic, fallback responses, and graceful degradation when services are unavailable

Workflow Engine¶

The Workflow Engine executes predefined business logic and tool integrations:

Trigger Evaluation — Evaluates event triggers (new message, scheduled time, external webhook) to initiate workflows
Action Execution — Runs actions such as CRM updates, calendar bookings, email sending, and webhook calls
Conditional Logic — Supports branching, loops, and conditional execution based on data values
State Management — Persists workflow state across long-running processes (e.g., multi-step approval flows)
Error Recovery — Automatically retries failed actions with exponential backoff
Execution Logging — Records every step of workflow execution for auditing and debugging

Human-in-the-Loop (HITL) Orchestrator¶

The HITL Orchestrator manages the interaction between AI agents and human team members:

Escalation Routing — Routes complex or low-confidence interactions to the appropriate human specialist
Queue Management — Maintains priority queues for human review based on urgency, customer tier, and wait time
Context Handoff — Provides human reviewers with full conversation context, AI reasoning, and suggested actions
Learning Loop — Captures human resolutions and feeds them back into the agent’s training pipeline
SLA Monitoring — Tracks escalation response times and alerts management when SLAs are at risk

Prompt Management System¶

The Prompt Management System handles all aspects of prompt engineering and versioning:

Prompt Versioning — Maintains version history for all system prompts with rollback capability
A/B Testing — Supports running multiple prompt variants simultaneously to measure performance
Dynamic Injection — Injects runtime variables (customer name, order details, appointment times) into prompts
Template Library — Provides pre-built prompt templates for common agent types (Sales, Support, Receptionist, etc.)
Website Analysis Integration — Auto-generates initial prompts by analyzing business websites using Groq AI compound models

4. AI / Intelligence Layer¶

The AI / Intelligence Layer provides the machine learning capabilities that power agent intelligence.

Private AI Gateway¶

The Private AI Gateway is the central routing and governance layer for all LLM interactions. It provides:

Unified API — A single OpenAI-compatible API endpoint that routes to 100+ models across multiple providers
Multi-Model Orchestration — Dynamically selects the optimal model for each request based on task type, latency requirements, and cost constraints
Automatic Fallbacks — If a provider experiences downtime, requests are automatically rerouted to an equivalent model
Load Balancing — Distributes requests across multiple API keys and endpoints to avoid rate limits
Real-Time Token Auditing — Tracks every token consumed with granular attribution to specific agents, workflows, and customers
Budget Enforcement — Enforces hard spending limits with automated alerts at 50%, 80%, and 100% thresholds
Zero Data Retention — Prioritizes providers that guarantee no data is stored or used for model training
Context Scrutiny — Applies guardrails to sanitize sensitive information before sending data to external LLMs

RAG Pipeline¶

The Retrieval-Augmented Generation (RAG) pipeline enables agents to answer questions using proprietary business documents. See the Knowledge Base documentation for full details.

Pipeline stages:

Ingestion — Documents are uploaded and text is extracted (supporting PDF, DOCX, TXT, CSV, MD)
Chunking — Text is split into semantic chunks of approximately 500 tokens each, preserving paragraph and section boundaries
Embedding — Each chunk is converted into a high-dimensional vector embedding using a dedicated embedding model
Indexing — Embeddings are stored in a pgvector-enabled PostgreSQL database with HNSW (Hierarchical Navigable Small World) indexes for fast approximate nearest-neighbor search
Retrieval — At query time, the user’s message is embedded and compared against stored vectors using cosine similarity
Re-ranking — Retrieved chunks are re-ranked using a cross-encoder model for improved relevance
Context Injection — The top-ranked chunks are injected into the LLM prompt as grounding context

Retrieval configuration:

Parameter	Default	Description
Top-K	5	Number of chunks to retrieve
Similarity Threshold	0.7	Minimum cosine similarity score
Max Context Length	3,000 tokens	Maximum total tokens from knowledge base
Chunk Size	~500 tokens	Target size for each text chunk
Chunk Overlap	50 tokens	Overlap between consecutive chunks

Embedding Service¶

The Embedding Service converts text into vector representations:

Model — Uses OpenAI text-embedding-3-small (1,536 dimensions) as the default embedding model
Batching — Processes multiple chunks in parallel for efficient document ingestion
Normalization — Applies L2 normalization to embeddings for consistent cosine similarity calculations
Versioning — Supports embedding model versioning to allow re-embedding when models are upgraded

Website Analyzer¶

The Website Analyzer uses Groq AI compound models to automatically generate system prompts:

Web Scraping — Custom scraper extracts text content from target websites
AI Analysis — Groq’s compound model analyzes content to identify business context, services, tone, and FAQs
Web Search — Supplementary web search gathers additional business context
Prompt Generation — A structured system prompt is generated incorporating all extracted information

5. Infrastructure Layer¶

The Infrastructure Layer provides the foundational compute, storage, and networking services.

Supabase Platform¶

4Geeks AI Agents leverages Supabase as the primary backend platform:

Component	Purpose
PostgreSQL	Primary relational database for agent configurations, conversation logs, user accounts, and workflow state
pgvector	Vector extension for PostgreSQL enabling semantic search over knowledge base embeddings
Supabase Storage	Object storage for uploaded knowledge base files, call recordings, and media attachments
Supabase Auth	Authentication and authorization for the console dashboard and API access
Edge Functions	Serverless Deno-based functions for real-time processing (VAPI tool handling, knowledge base processing, webhook handling)
Realtime	WebSocket-based real-time subscriptions for live dashboard updates

Edge Functions¶

The platform runs several critical Deno-based edge functions:

Function	Purpose
vapi-tool-handler	Processes tool calls from voice agents during phone calls (calendar booking, CRM lookups, call transfers)
process-knowledge	Handles document ingestion pipeline: text extraction, chunking, embedding, and storage
whatsapp-webhook	Receives and processes incoming WhatsApp messages from Meta’s Cloud API
workflow-trigger	Evaluates and executes workflow triggers based on incoming events
website-analyzer	Scrapes websites and generates system prompts using Groq AI

Message Queue¶

An internal message queue decouples components and ensures reliable processing:

Asynchronous Processing — Long-running tasks (document processing, campaign calls) are queued for background execution
Retry Logic — Failed messages are automatically retried with configurable retry policies
Dead Letter Queue — Messages that fail after maximum retries are routed to a dead letter queue for manual inspection
Ordering Guarantees — Conversation messages are processed in order to maintain context integrity

Redis Cache¶

Redis provides low-latency caching for frequently accessed data:

Session State — Stores active conversation sessions for fast context retrieval
Rate Limiting — Implements sliding-window rate limits for API and agent access
Configuration Cache — Caches agent configurations to reduce database load
Pub/Sub — Enables real-time communication between distributed components

Data Flow¶

Text Message Flow (WhatsApp / API)¶

sequenceDiagram
    participant U as End User
    participant WA as WhatsApp / API
    participant GW as API Gateway
    participant AR as Agent Runtime
    participant KB as Knowledge Base
    participant LLM as LLM (via Gateway)
    participant DB as Database

    U->>WA: Sends text message
    WA->>GW: Webhook / API request
    GW->>GW: Authenticate & rate limit
    GW->>AR: Forward normalized message
    AR->>DB: Load conversation history
    AR->>KB: Semantic search for relevant context
    KB-->>AR: Return top-K chunks
    AR->>AR: Assemble prompt (system + context + history + message)
    AR->>LLM: Request completion
    LLM-->>AR: Generate response
    AR->>DB: Save interaction & update credits
    AR-->>GW: Return response
    GW-->>WA: Deliver response
    WA-->>U: Display response

Voice Call Flow (Phone)¶

sequenceDiagram
    participant U as Caller
    participant V as VAPI (Telephony)
    participant DG as Deepgram (STT)
    participant AR as Agent Runtime
    participant LLM as LLM (via Gateway)
    participant EL as ElevenLabs (TTS)
    participant DB as Database

    U->>V: Initiates phone call
    V->>DG: Stream audio for transcription
    DG-->>AR: Real-time transcript
    AR->>AR: Assemble prompt
    AR->>LLM: Request completion
    LLM-->>AR: Generate response
    alt Tool call needed
        AR->>AR: Execute tool (calendar, CRM, etc.)
        AR->>LLM: Follow-up with tool result
        LLM-->>AR: Final response
    end
    AR->>EL: Send text for synthesis
    EL-->>V: Stream audio back
    V-->>U: Hear AI response
    AR->>DB: Log call data & update credits

Knowledge Base Ingestion Flow¶

sequenceDiagram
    participant U as User
    participant C as Console
    participant EF as Edge Function
    participant S as Storage
    participant E as Embedding Service
    participant DB as PostgreSQL + pgvector

    U->>C: Upload document
    C->>S: Store raw file
    C->>EF: Trigger processing
    EF->>S: Download file
    EF->>EF: Extract text (PDF/DOCX/TXT/CSV/MD)
    EF->>EF: Chunk text (~500 tokens per chunk)
    EF->>E: Batch embed chunks
    E-->>EF: Return vectors
    EF->>DB: Store chunks + embeddings + metadata
    EF->>C: Update status to "Ready"
    C-->>U: Document available for retrieval

Security Architecture¶

Network Security¶

Layer	Protection
Transport	All data in transit is encrypted using TLS 1.3
API Gateway	Rate limiting, DDoS protection, and IP allowlisting
VPC	Core services run within a private Virtual Private Cloud
Firewall	Network-level firewall rules restrict access to authorized services only
DNS	DNSSEC-enabled domains prevent DNS spoofing attacks

Data Security¶

Concern	Implementation
Encryption at Rest	All databases and storage buckets use AES-256 encryption
Encryption in Transit	TLS 1.3 for all internal and external communications
Zero Data Retention	LLM providers are contractually bound to not store or train on customer data
Data Isolation	Each customer’s data is logically isolated using row-level security (RLS) policies in PostgreSQL
PII Scrubbing	Sensitive information can be automatically redacted before sending to LLMs
Audit Logging	All data access is logged with full audit trails

Authentication & Authorization¶

Component	Method
Console Access	Supabase Auth with MFA support
API Access	API keys with scoped permissions
WhatsApp Webhooks	Meta’s webhook signature verification
Edge Functions	JWT-based authentication for all function invocations
Row-Level Security	PostgreSQL RLS policies ensure data isolation between tenants

Compliance¶

Standard	Status
GDPR	Compliant — data processing agreements, right to erasure, data portability
EU AI Act	Compliant — transparency obligations, risk assessments, human oversight
SOC 2	In progress — security, availability, and confidentiality controls
HIPAA	Available via custom configurations for healthcare clients
PCI-DSS	Available via custom configurations for payment processing

Scalability & Performance¶

Horizontal Scaling¶

The platform is designed to scale horizontally across all components:

Stateless API Servers — API gateway instances can be added or removed based on load
Edge Functions — Supabase Edge Functions auto-scale based on request volume
Database — PostgreSQL read replicas handle increased query load; write operations are optimized with connection pooling via PgBouncer
Cache — Redis Cluster provides distributed caching with automatic sharding

Performance Targets¶

Metric	Target	Measurement
Text Response Latency	< 2 seconds	Time from message receipt to response delivery
Voice Response Latency	~1.2 seconds	Total round-trip: STT + LLM + TTS
Knowledge Base Retrieval	< 200ms	Time for semantic search across 10,000+ chunks
API Availability	99.9%	Monthly uptime SLA
Concurrent Agents	1,000+	Simultaneous active agent instances
Messages Per Second	500+	Throughput capacity for text interactions

Latency Breakdown (Voice)¶

Component	Typical Latency	Technology
Speech-to-Text	~300ms	Deepgram streaming transcription
LLM Processing	~500ms	GPT-4o-mini via Private AI Gateway
Text-to-Speech	~400ms	ElevenLabs low-latency streaming
Network Overhead	~50ms	Internal routing and protocol overhead
Total	~1,250ms	End-to-end voice response time

Monitoring & Observability¶

Logging¶

All platform components emit structured JSON logs:

Application Logs — Agent interactions, workflow executions, error traces
Infrastructure Logs — Edge function invocations, database queries, cache hits/misses
Security Logs — Authentication events, authorization failures, suspicious activity

Metrics¶

Key performance indicators are tracked in real-time:

Agent Performance — Response times, confidence scores, escalation rates
Credit Consumption — Per-agent, per-channel, per-customer credit usage
System Health — CPU, memory, disk, network utilization across all services
LLM Provider Health — Latency, error rates, and availability per provider

Alerting¶

Automated alerts notify the operations team when:

Error rates exceed thresholds
Response latencies degrade
Credit consumption spikes unexpectedly
LLM providers experience outages
Security events are detected

What’s Next¶

LLM Models — Detailed breakdown of the language models powering the platform
The Human Team — Meet the team behind the human-in-the-loop orchestration
Knowledge Base (RAG) — Configure domain-specific knowledge for your agents
Voice AI & Campaigns — Deploy voice agents with phone integration

Aún con dudas? Pregunta en Discord o explore tutoriales