Saltar a contenido

🤖 Explicar con IA

Cloud & LLM Architecture

4Geeks AI Agents is built on a modern, cloud-native architecture designed for high availability, low latency, and enterprise-grade security. This document provides a comprehensive technical overview of the platform’s infrastructure, data flow, and system design.

High-Level Architecture

The 4Geeks AI Agents platform is composed of five core layers that work together to deliver intelligent, automated business processes:

graph TB
    subgraph Client["Client Layer"]
        WA[WhatsApp Business]
        PH[Phone / PSTN]
        WEB[Web / Console]
        API[Custom API Clients]
    end

    subgraph Channel["Channel Layer"]
        WAPI[WhatsApp Business API]
        VAPI[VAPI Telephony]
        REST[REST API Gateway]
    end

    subgraph Core["Core Orchestration Layer"]
        AGENT[Agent Runtime Engine]
        WORKFLOW[Workflow Engine]
        HITL[Human-in-the-Loop Orchestrator]
        PROMPT[Prompt Management System]
    end

    subgraph AI["AI / Intelligence Layer"]
        GATEWAY[Private AI Gateway]
        RAG[RAG Pipeline]
        EMBED[Embedding Service]
        WA[Website Analyzer]
    end

    subgraph Infra["Infrastructure Layer"]
        SUPA[Supabase Platform]
        EDGE[Edge Functions]
        STORAGE[Object Storage]
        DB[(PostgreSQL + pgvector)]
        QUEUE[Message Queue]
        CACHE[Redis Cache]
    end

    WA --> WAPI
    PH --> VAPI
    WEB --> REST
    API --> REST

    WAPI --> AGENT
    VAPI --> AGENT
    REST --> AGENT

    AGENT --> WORKFLOW
    AGENT --> HITL
    AGENT --> PROMPT
    AGENT --> GATEWAY
    AGENT --> RAG

    GATEWAY --> LLM1[OpenAI GPT-4o]
    GATEWAY --> LLM2[Anthropic Claude]
    GATEWAY --> LLM3[Google Gemini]
    GATEWAY --> LLM4[Meta Llama]
    GATEWAY --> LLM5[Groq LPU]

    RAG --> EMBED
    EMBED --> DB
    RAG --> DB
    WA --> GATEWAY

    AGENT --> EDGE
    WORKFLOW --> EDGE
    EDGE --> SUPA
    STORAGE --> SUPA
    DB --> SUPA

Layer-by-Layer Breakdown

1. Client Layer

The Client Layer represents all the external touchpoints through which end-users interact with AI Agents. This layer is protocol-agnostic, meaning the core platform does not care whether a message arrives via WhatsApp, a phone call, or a direct API call.

Channel Protocol Direction Description
WhatsApp Business HTTPS (Meta Cloud API) Inbound + Outbound Text messages, media, interactive buttons, and templates delivered through the official WhatsApp Business Platform
Phone / PSTN SIP / WebRTC (via VAPI) Inbound + Outbound Voice calls with real-time speech-to-text and text-to-speech conversion
Web Console HTTPS / WebSocket Bidirectional The 4Geeks dashboard for agent management, monitoring, and the playground testing environment
Custom API REST (OpenAI-compatible) Bidirectional Programmatic access for custom integrations using a standard OpenAI-compatible API format

2. Channel Layer

The Channel Layer acts as a protocol adapter, normalizing incoming messages from different communication channels into a unified internal format. This abstraction allows the Core Orchestration Layer to process all messages identically, regardless of their origin.

WhatsApp Business API Adapter

The WhatsApp adapter connects to Meta’s Cloud API and handles:

  • Webhook Management — Receives real-time message delivery events from Meta
  • Message Normalization — Converts WhatsApp message formats (text, image, audio, document, interactive) into the internal unified message schema
  • Template Management — Manages pre-approved WhatsApp message templates for outbound proactive messages
  • Session Management — Tracks 24-hour conversation windows as required by Meta’s policies
  • Media Handling — Downloads and processes media files (images, audio, documents) sent via WhatsApp
  • Delivery Receipts — Tracks message delivery and read status

VAPI Telephony Adapter

The Voice API (VAPI) adapter manages all voice interactions:

  • SIP Trunking — Establishes and terminates phone calls via Session Initiation Protocol
  • Audio Streaming — Streams bidirectional audio between the caller and the AI processing pipeline
  • DTMF Detection — Captures touch-tone inputs from the caller’s keypad
  • Call Transfer — Routes calls to human agents when escalation is needed
  • Call Recording — Records calls for quality assurance and compliance (with consent)
  • Phone Number Management — Provisions and manages dedicated phone numbers across countries

3. Core Orchestration Layer

The Core Orchestration Layer is the brain of the platform. It receives normalized messages from the Channel Layer and coordinates all processing, decision-making, and response generation.

Agent Runtime Engine

The Agent Runtime Engine is responsible for executing the lifecycle of each agent interaction:

sequenceDiagram
    participant C as Channel
    participant A as Agent Runtime
    participant P as Prompt Manager
    participant R as RAG Pipeline
    participant G as AI Gateway
    participant W as Workflow Engine
    participant H as HITL Orchestrator

    C->>A: Incoming message (normalized)
    A->>A: Identify agent & load configuration
    A->>P: Retrieve system prompt + conversation history
    P-->>A: Full prompt context
    A->>R: Query knowledge base (if configured)
    R-->>A: Retrieved context chunks
    A->>A: Assemble final prompt (system + context + history + user message)
    A->>G: Send to LLM via Private AI Gateway
    G-->>A: LLM response
    A->>A: Parse response for tool calls
    alt Tool calls detected
        A->>W: Execute workflow/tool
        W-->>A: Tool execution result
        A->>G: Send follow-up with tool results
        G-->>A: Final LLM response
    end
    alt Confidence below threshold
        A->>H: Escalate to human
        H-->>A: Human resolution
        A->>A: Learn from resolution
    end
    A-->>C: Send response (normalized back to channel format)

Key responsibilities:

  • Agent Identification — Routes incoming messages to the correct agent based on channel, phone number, or conversation context
  • Context Assembly — Gathers all relevant context including system prompt, conversation history, knowledge base results, and user metadata
  • Multi-Turn Conversation Management — Maintains conversation state across multiple message exchanges with configurable memory windows
  • Tool Call Orchestration — Detects when the LLM wants to execute a tool (CRM lookup, calendar booking, etc.) and manages the execution pipeline
  • Confidence Scoring — Evaluates the agent’s confidence in its response and triggers human escalation when confidence falls below configurable thresholds
  • Rate Limiting — Enforces per-agent and per-user rate limits to prevent abuse and manage costs
  • Error Handling — Implements retry logic, fallback responses, and graceful degradation when services are unavailable

Workflow Engine

The Workflow Engine executes predefined business logic and tool integrations:

  • Trigger Evaluation — Evaluates event triggers (new message, scheduled time, external webhook) to initiate workflows
  • Action Execution — Runs actions such as CRM updates, calendar bookings, email sending, and webhook calls
  • Conditional Logic — Supports branching, loops, and conditional execution based on data values
  • State Management — Persists workflow state across long-running processes (e.g., multi-step approval flows)
  • Error Recovery — Automatically retries failed actions with exponential backoff
  • Execution Logging — Records every step of workflow execution for auditing and debugging

Human-in-the-Loop (HITL) Orchestrator

The HITL Orchestrator manages the interaction between AI agents and human team members:

  • Escalation Routing — Routes complex or low-confidence interactions to the appropriate human specialist
  • Queue Management — Maintains priority queues for human review based on urgency, customer tier, and wait time
  • Context Handoff — Provides human reviewers with full conversation context, AI reasoning, and suggested actions
  • Learning Loop — Captures human resolutions and feeds them back into the agent’s training pipeline
  • SLA Monitoring — Tracks escalation response times and alerts management when SLAs are at risk

Prompt Management System

The Prompt Management System handles all aspects of prompt engineering and versioning:

  • Prompt Versioning — Maintains version history for all system prompts with rollback capability
  • A/B Testing — Supports running multiple prompt variants simultaneously to measure performance
  • Dynamic Injection — Injects runtime variables (customer name, order details, appointment times) into prompts
  • Template Library — Provides pre-built prompt templates for common agent types (Sales, Support, Receptionist, etc.)
  • Website Analysis Integration — Auto-generates initial prompts by analyzing business websites using Groq AI compound models

4. AI / Intelligence Layer

The AI / Intelligence Layer provides the machine learning capabilities that power agent intelligence.

Private AI Gateway

The Private AI Gateway is the central routing and governance layer for all LLM interactions. It provides:

  • Unified API — A single OpenAI-compatible API endpoint that routes to 100+ models across multiple providers
  • Multi-Model Orchestration — Dynamically selects the optimal model for each request based on task type, latency requirements, and cost constraints
  • Automatic Fallbacks — If a provider experiences downtime, requests are automatically rerouted to an equivalent model
  • Load Balancing — Distributes requests across multiple API keys and endpoints to avoid rate limits
  • Real-Time Token Auditing — Tracks every token consumed with granular attribution to specific agents, workflows, and customers
  • Budget Enforcement — Enforces hard spending limits with automated alerts at 50%, 80%, and 100% thresholds
  • Zero Data Retention — Prioritizes providers that guarantee no data is stored or used for model training
  • Context Scrutiny — Applies guardrails to sanitize sensitive information before sending data to external LLMs

RAG Pipeline

The Retrieval-Augmented Generation (RAG) pipeline enables agents to answer questions using proprietary business documents. See the Knowledge Base documentation for full details.

Pipeline stages:

  1. Ingestion — Documents are uploaded and text is extracted (supporting PDF, DOCX, TXT, CSV, MD)
  2. Chunking — Text is split into semantic chunks of approximately 500 tokens each, preserving paragraph and section boundaries
  3. Embedding — Each chunk is converted into a high-dimensional vector embedding using a dedicated embedding model
  4. Indexing — Embeddings are stored in a pgvector-enabled PostgreSQL database with HNSW (Hierarchical Navigable Small World) indexes for fast approximate nearest-neighbor search
  5. Retrieval — At query time, the user’s message is embedded and compared against stored vectors using cosine similarity
  6. Re-ranking — Retrieved chunks are re-ranked using a cross-encoder model for improved relevance
  7. Context Injection — The top-ranked chunks are injected into the LLM prompt as grounding context

Retrieval configuration:

Parameter Default Description
Top-K 5 Number of chunks to retrieve
Similarity Threshold 0.7 Minimum cosine similarity score
Max Context Length 3,000 tokens Maximum total tokens from knowledge base
Chunk Size ~500 tokens Target size for each text chunk
Chunk Overlap 50 tokens Overlap between consecutive chunks

Embedding Service

The Embedding Service converts text into vector representations:

  • Model — Uses OpenAI text-embedding-3-small (1,536 dimensions) as the default embedding model
  • Batching — Processes multiple chunks in parallel for efficient document ingestion
  • Normalization — Applies L2 normalization to embeddings for consistent cosine similarity calculations
  • Versioning — Supports embedding model versioning to allow re-embedding when models are upgraded

Website Analyzer

The Website Analyzer uses Groq AI compound models to automatically generate system prompts:

  • Web Scraping — Custom scraper extracts text content from target websites
  • AI Analysis — Groq’s compound model analyzes content to identify business context, services, tone, and FAQs
  • Web Search — Supplementary web search gathers additional business context
  • Prompt Generation — A structured system prompt is generated incorporating all extracted information

5. Infrastructure Layer

The Infrastructure Layer provides the foundational compute, storage, and networking services.

Supabase Platform

4Geeks AI Agents leverages Supabase as the primary backend platform:

Component Purpose
PostgreSQL Primary relational database for agent configurations, conversation logs, user accounts, and workflow state
pgvector Vector extension for PostgreSQL enabling semantic search over knowledge base embeddings
Supabase Storage Object storage for uploaded knowledge base files, call recordings, and media attachments
Supabase Auth Authentication and authorization for the console dashboard and API access
Edge Functions Serverless Deno-based functions for real-time processing (VAPI tool handling, knowledge base processing, webhook handling)
Realtime WebSocket-based real-time subscriptions for live dashboard updates

Edge Functions

The platform runs several critical Deno-based edge functions:

Function Purpose
vapi-tool-handler Processes tool calls from voice agents during phone calls (calendar booking, CRM lookups, call transfers)
process-knowledge Handles document ingestion pipeline: text extraction, chunking, embedding, and storage
whatsapp-webhook Receives and processes incoming WhatsApp messages from Meta’s Cloud API
workflow-trigger Evaluates and executes workflow triggers based on incoming events
website-analyzer Scrapes websites and generates system prompts using Groq AI

Message Queue

An internal message queue decouples components and ensures reliable processing:

  • Asynchronous Processing — Long-running tasks (document processing, campaign calls) are queued for background execution
  • Retry Logic — Failed messages are automatically retried with configurable retry policies
  • Dead Letter Queue — Messages that fail after maximum retries are routed to a dead letter queue for manual inspection
  • Ordering Guarantees — Conversation messages are processed in order to maintain context integrity

Redis Cache

Redis provides low-latency caching for frequently accessed data:

  • Session State — Stores active conversation sessions for fast context retrieval
  • Rate Limiting — Implements sliding-window rate limits for API and agent access
  • Configuration Cache — Caches agent configurations to reduce database load
  • Pub/Sub — Enables real-time communication between distributed components

Data Flow

Text Message Flow (WhatsApp / API)

sequenceDiagram
    participant U as End User
    participant WA as WhatsApp / API
    participant GW as API Gateway
    participant AR as Agent Runtime
    participant KB as Knowledge Base
    participant LLM as LLM (via Gateway)
    participant DB as Database

    U->>WA: Sends text message
    WA->>GW: Webhook / API request
    GW->>GW: Authenticate & rate limit
    GW->>AR: Forward normalized message
    AR->>DB: Load conversation history
    AR->>KB: Semantic search for relevant context
    KB-->>AR: Return top-K chunks
    AR->>AR: Assemble prompt (system + context + history + message)
    AR->>LLM: Request completion
    LLM-->>AR: Generate response
    AR->>DB: Save interaction & update credits
    AR-->>GW: Return response
    GW-->>WA: Deliver response
    WA-->>U: Display response

Voice Call Flow (Phone)

sequenceDiagram
    participant U as Caller
    participant V as VAPI (Telephony)
    participant DG as Deepgram (STT)
    participant AR as Agent Runtime
    participant LLM as LLM (via Gateway)
    participant EL as ElevenLabs (TTS)
    participant DB as Database

    U->>V: Initiates phone call
    V->>DG: Stream audio for transcription
    DG-->>AR: Real-time transcript
    AR->>AR: Assemble prompt
    AR->>LLM: Request completion
    LLM-->>AR: Generate response
    alt Tool call needed
        AR->>AR: Execute tool (calendar, CRM, etc.)
        AR->>LLM: Follow-up with tool result
        LLM-->>AR: Final response
    end
    AR->>EL: Send text for synthesis
    EL-->>V: Stream audio back
    V-->>U: Hear AI response
    AR->>DB: Log call data & update credits

Knowledge Base Ingestion Flow

sequenceDiagram
    participant U as User
    participant C as Console
    participant EF as Edge Function
    participant S as Storage
    participant E as Embedding Service
    participant DB as PostgreSQL + pgvector

    U->>C: Upload document
    C->>S: Store raw file
    C->>EF: Trigger processing
    EF->>S: Download file
    EF->>EF: Extract text (PDF/DOCX/TXT/CSV/MD)
    EF->>EF: Chunk text (~500 tokens per chunk)
    EF->>E: Batch embed chunks
    E-->>EF: Return vectors
    EF->>DB: Store chunks + embeddings + metadata
    EF->>C: Update status to "Ready"
    C-->>U: Document available for retrieval

Security Architecture

Network Security

Layer Protection
Transport All data in transit is encrypted using TLS 1.3
API Gateway Rate limiting, DDoS protection, and IP allowlisting
VPC Core services run within a private Virtual Private Cloud
Firewall Network-level firewall rules restrict access to authorized services only
DNS DNSSEC-enabled domains prevent DNS spoofing attacks

Data Security

Concern Implementation
Encryption at Rest All databases and storage buckets use AES-256 encryption
Encryption in Transit TLS 1.3 for all internal and external communications
Zero Data Retention LLM providers are contractually bound to not store or train on customer data
Data Isolation Each customer’s data is logically isolated using row-level security (RLS) policies in PostgreSQL
PII Scrubbing Sensitive information can be automatically redacted before sending to LLMs
Audit Logging All data access is logged with full audit trails

Authentication & Authorization

Component Method
Console Access Supabase Auth with MFA support
API Access API keys with scoped permissions
WhatsApp Webhooks Meta’s webhook signature verification
Edge Functions JWT-based authentication for all function invocations
Row-Level Security PostgreSQL RLS policies ensure data isolation between tenants

Compliance

Standard Status
GDPR Compliant — data processing agreements, right to erasure, data portability
EU AI Act Compliant — transparency obligations, risk assessments, human oversight
SOC 2 In progress — security, availability, and confidentiality controls
HIPAA Available via custom configurations for healthcare clients
PCI-DSS Available via custom configurations for payment processing

Scalability & Performance

Horizontal Scaling

The platform is designed to scale horizontally across all components:

  • Stateless API Servers — API gateway instances can be added or removed based on load
  • Edge Functions — Supabase Edge Functions auto-scale based on request volume
  • Database — PostgreSQL read replicas handle increased query load; write operations are optimized with connection pooling via PgBouncer
  • Cache — Redis Cluster provides distributed caching with automatic sharding

Performance Targets

Metric Target Measurement
Text Response Latency < 2 seconds Time from message receipt to response delivery
Voice Response Latency ~1.2 seconds Total round-trip: STT + LLM + TTS
Knowledge Base Retrieval < 200ms Time for semantic search across 10,000+ chunks
API Availability 99.9% Monthly uptime SLA
Concurrent Agents 1,000+ Simultaneous active agent instances
Messages Per Second 500+ Throughput capacity for text interactions

Latency Breakdown (Voice)

Component Typical Latency Technology
Speech-to-Text ~300ms Deepgram streaming transcription
LLM Processing ~500ms GPT-4o-mini via Private AI Gateway
Text-to-Speech ~400ms ElevenLabs low-latency streaming
Network Overhead ~50ms Internal routing and protocol overhead
Total ~1,250ms End-to-end voice response time

Monitoring & Observability

Logging

All platform components emit structured JSON logs:

  • Application Logs — Agent interactions, workflow executions, error traces
  • Infrastructure Logs — Edge function invocations, database queries, cache hits/misses
  • Security Logs — Authentication events, authorization failures, suspicious activity

Metrics

Key performance indicators are tracked in real-time:

  • Agent Performance — Response times, confidence scores, escalation rates
  • Credit Consumption — Per-agent, per-channel, per-customer credit usage
  • System Health — CPU, memory, disk, network utilization across all services
  • LLM Provider Health — Latency, error rates, and availability per provider

Alerting

Automated alerts notify the operations team when:

  • Error rates exceed thresholds
  • Response latencies degrade
  • Credit consumption spikes unexpectedly
  • LLM providers experience outages
  • Security events are detected

What’s Next

  • LLM Models — Detailed breakdown of the language models powering the platform
  • The Human Team — Meet the team behind the human-in-the-loop orchestration
  • Knowledge Base (RAG) — Configure domain-specific knowledge for your agents
  • Voice AI & Campaigns — Deploy voice agents with phone integration

AĂşn con dudas? Pregunta en Discord o explore tutoriales