Saltar a contenido

🤖 Explicar con IA

LLM Models

4Geeks AI Agents leverages a multi-model architecture through the Private AI Gateway, dynamically selecting the optimal Large Language Model for each interaction. This document provides a detailed breakdown of every model available, how they are used, and how the platform decides which model to use for a given task.

Multi-Model Orchestration

The 4Geeks Private AI Gateway does not rely on a single LLM provider. Instead, it orchestrates across multiple providers to ensure the best combination of quality, speed, and cost for each interaction.

graph LR
    subgraph Request["Incoming Request"]
        MSG[User Message]
        CTX[Context Type]
        CFG[Agent Config]
    end

    subgraph Router["Gateway Router"]
        AN[Analyzer]
        SEL[Model Selector]
        LB[Load Balancer]
        FB[Fallback Handler]
    end

    subgraph Providers["LLM Providers"]
        OA[OpenAI]
        ANTH[Anthropic]
        GOOG[Google]
        META[Meta / Llama]
        GR[Groq]
    end

    MSG --> AN
    CTX --> AN
    CFG --> AN
    AN --> SEL
    SEL --> LB
    LB --> OA
    LB --> ANTH
    LB --> GOOG
    LB --> META
    LB --> GR
    OA --> FB
    ANTH --> FB
    GOOG --> FB
    META --> FB
    GR --> FB

How Model Selection Works

For every request, the Gateway evaluates several factors:

Factor Description
Task Type Text conversation, voice response, website analysis, embedding, or tool execution
Latency Requirement Voice interactions require ultra-low latency; text can tolerate slightly longer processing
Complexity Simple FAQs use lightweight models; complex reasoning tasks use frontier models
Agent Configuration Each agent can specify preferred models or allow dynamic selection
Cost Optimization The gateway prefers cost-effective models when quality requirements are met
Provider Health If a provider is experiencing issues, traffic is automatically rerouted
Rate Limits Requests are distributed across API keys to avoid hitting provider rate limits

Automatic Fallbacks

If the primary model or provider is unavailable, the Gateway automatically falls back to an equivalent model:

graph TD
    A[Primary Model Request] --> B{Success?}
    B -->|Yes| C[Return Response]
    B -->|No - Timeout| D[Try Secondary Model]
    B -->|No - Error| E[Try Alternate Provider]
    D --> F{Success?}
    F -->|Yes| C
    F -->|No| G[Try Tertiary Model]
    E --> H{Success?}
    H -->|Yes| C
    H -->|No| G
    G --> I{Success?}
    I -->|Yes| C
    I -->|No| J[Return Graceful Error]

Fallback chain example for a text agent:

  1. Primary: OpenAI GPT-4o-mini (fast, cost-effective)
  2. Secondary: Anthropic Claude 3.5 Haiku (similar capability)
  3. Tertiary: Google Gemini 2.0 Flash (alternative provider)
  4. Final: Return a graceful error message to the user

Available Models

OpenAI Models

OpenAI provides the primary family of models used across the platform for text and voice interactions.

Model Parameters Context Window Best For Latency Relative Cost
GPT-4o ~200B (est.) 128K tokens Complex reasoning, nuanced conversations, long-document analysis Medium High
GPT-4o-mini ~8B (est.) 128K tokens General-purpose text agents, FAQs, lead qualification, most voice interactions Low Low
GPT-4o-mini (Voice) ~8B (est.) 128K tokens Optimized for real-time voice conversations with tool calling Very Low Low
text-embedding-3-small N/A 8,191 tokens Knowledge base embeddings, semantic search Very Low Very Low
text-embedding-3-large N/A 8,191 tokens High-precision embeddings for critical knowledge bases Very Low Low

GPT-4o

The flagship OpenAI model, used for tasks requiring deep reasoning and nuanced understanding:

  • Use Cases: Complex customer inquiries requiring multi-step reasoning, document summarization, contract analysis, nuanced sales conversations
  • Strengths: Excellent at following complex instructions, strong at structured output (JSON), multilingual capability, vision support
  • Limitations: Higher latency and cost compared to mini variants
  • Default Temperature: 0.7

GPT-4o-mini

The workhorse model that handles the majority of agent interactions:

  • Use Cases: Standard text conversations, WhatsApp messaging, lead qualification, FAQ answering, appointment scheduling, most voice interactions
  • Strengths: Very fast response times, excellent cost-efficiency, 128K context window, strong tool-calling capability
  • Limitations: Less capable than GPT-4o on very complex reasoning tasks
  • Default Temperature: 0.7

text-embedding-3-small

The default embedding model for the Knowledge Base (RAG) system:

  • Dimensions: 1,536
  • Use Cases: Converting knowledge base documents into vector embeddings for semantic search
  • Strengths: Excellent balance of quality and speed, cost-effective for large document collections
  • Limitations: Lower precision than the large variant for highly specialized domains

Anthropic Claude Models

Anthropic’s Claude models are used for tasks requiring strong safety guardrails and careful instruction following.

Model Context Window Best For Latency Relative Cost
Claude 3.5 Sonnet 200K tokens Complex analysis, long-document processing, careful reasoning Medium High
Claude 3.5 Haiku 200K tokens Fast general-purpose tasks, fallback for GPT-4o-mini Low Medium

Claude 3.5 Sonnet

Used selectively for high-stakes or complex tasks:

  • Use Cases: Legal document analysis, compliance-related conversations, complex multi-step workflows, tasks requiring very careful instruction following
  • Strengths: Excellent safety characteristics, strong at nuanced reasoning, 200K context window for processing long documents, very good at structured output
  • Limitations: Higher cost, slightly slower than GPT-4o-mini for simple tasks
  • Default Temperature: 0.7

Claude 3.5 Haiku

Serves as both a primary model and a critical fallback:

  • Use Cases: General-purpose conversations, fallback when OpenAI services are degraded, tasks requiring strong safety properties
  • Strengths: Very fast, good quality, strong safety guardrails, large context window
  • Limitations: Less capable than Sonnet on complex reasoning
  • Default Temperature: 0.7

Google Gemini Models

Google’s Gemini models provide additional diversity and are particularly strong in multimodal tasks.

Model Context Window Best For Latency Relative Cost
Gemini 2.0 Flash 1M tokens Very long context tasks, multilingual conversations Low Medium
Gemini 2.0 Flash-Lite 1M tokens High-volume simple tasks, cost-sensitive operations Very Low Very Low

Gemini 2.0 Flash

Selected for tasks requiring extremely large context windows:

  • Use Cases: Processing very long documents, multilingual conversations where language detection is important, tasks requiring broad world knowledge
  • Strengths: Massive 1M token context window, strong multilingual capability, fast inference, good at code understanding
  • Limitations: Less precise than GPT-4o on complex structured output
  • Default Temperature: 0.7

Gemini 2.0 Flash-Lite

Used for high-volume, cost-sensitive operations:

  • Use Cases: Bulk message processing, simple FAQ responses, high-traffic campaigns where cost efficiency is critical
  • Strengths: Extremely fast, very low cost, adequate quality for straightforward tasks
  • Limitations: Not suitable for complex reasoning or nuanced conversations
  • Default Temperature: 0.7

Meta Llama Models

Open-source Llama models provide an additional layer of redundancy and are used for specific tasks.

Model Parameters Context Window Best For Latency Relative Cost
Llama 3.3 70B 70B 128K tokens General-purpose tasks, open-source preference Medium Low
Llama 3.1 8B 8B 128K tokens Simple tasks, high-volume processing Very Low Very Low

Llama 3.3 70B

Used as both a primary and fallback model:

  • Use Cases: General conversations, tasks where open-source model preference is required, fallback for proprietary models
  • Strengths: Strong open-source model, good multilingual support (especially Spanish and Portuguese), no vendor lock-in
  • Limitations: Slightly less capable than GPT-4o on complex structured tasks
  • Default Temperature: 0.7

Groq LPU Models

Groq’s Language Processing Units (LPUs) deliver the fastest inference speeds available, making them ideal for latency-sensitive tasks.

Model Best For Latency Relative Cost
Groq Compound Model Website analysis, multi-step research tasks Low Medium
Llama 3.3 70B (via Groq) Ultra-fast text generation, real-time applications Very Low Low

Groq Compound Model

A specialized compound AI system used exclusively for website analysis and prompt generation:

  • Use Cases: Website scraping and analysis, automatic system prompt generation, multi-step research tasks that combine web search, code interpretation, and text generation
  • Components:
    • Web Search — Searches the web for additional business context
    • Code Interpreter — Processes and structures scraped data
    • LLM — Generates the final system prompt
  • Strengths: Compound reasoning across multiple tools, extremely fast execution, purpose-built for the website analysis workflow
  • Limitations: Specialized use case; not available for general agent conversations

Llama 3.3 70B via Groq

The same Llama model running on Groq’s custom hardware for ultra-low latency:

  • Use Cases: Tasks where response speed is the top priority, real-time applications, fallback when other providers are slow
  • Strengths: Fastest available inference (often < 100ms time-to-first-token), cost-effective
  • Limitations: Same capability limitations as the base Llama model
  • Default Temperature: 0.7

Model Usage by Feature

The following table shows which models are used for each platform feature:

Feature Primary Model Fallback Model(s) Reason
Text Agents (WhatsApp) GPT-4o-mini Claude 3.5 Haiku, Gemini 2.0 Flash Best balance of quality, speed, and cost for conversational text
Voice Agents (Phone) GPT-4o-mini (Voice) Llama 3.3 70B via Groq Ultra-low latency required for natural voice conversations
Complex Reasoning GPT-4o Claude 3.5 Sonnet Deep reasoning for complex multi-step tasks
Website Analysis Groq Compound Model N/A (specialized) Purpose-built compound AI for web analysis
Knowledge Base Embeddings text-embedding-3-small text-embedding-3-large High-quality embeddings at scale
Knowledge Base Retrieval GPT-4o-mini Claude 3.5 Haiku Fast synthesis of retrieved context into responses
Workflow Automation GPT-4o-mini Gemini 2.0 Flash-Lite Cost-effective processing of structured workflows
Campaign Scripts GPT-4o-mini Llama 3.3 70B Good quality for outbound call scripts

Model Configuration

Default Settings

Each agent type has optimized default model settings:

Setting Text Agents Voice Agents Custom Agents
Model GPT-4o-mini GPT-4o-mini (Voice) Configurable
Temperature 0.7 0.7 Configurable
Max Tokens 1,024 256 Configurable
Top-P 1.0 1.0 Configurable
Frequency Penalty 0.0 0.0 Configurable
Presence Penalty 0.0 0.0 Configurable

Temperature Guide

The temperature parameter controls the randomness of the model’s output:

Temperature Behavior Best For
0.0 - 0.2 Deterministic, consistent responses Factual Q&A, data lookup, compliance responses
0.3 - 0.5 Mostly consistent with some variation Customer support, appointment scheduling, collections
0.5 - 0.7 Balanced creativity and consistency Sales conversations, marketing content, general chat
0.7 - 0.9 Creative and varied responses Content generation, brainstorming, creative marketing
0.9 - 1.0 Highly creative, unpredictable Creative writing, ideation (not recommended for customer-facing)

Context Window Management

Different models have different context windows. The platform manages context efficiently:

graph LR
    subgraph Context["Context Window (128K tokens)"]
        SP[System Prompt<br/>~500-2,000 tokens]
        KC[Knowledge Base Context<br/>~1,500-3,000 tokens]
        CH[Conversation History<br/>~2,000-20,000 tokens]
        UM[User Message<br/>~50-500 tokens]
        RS[Response Space<br/>~256-1,024 tokens]
    end

Context allocation strategy:

  1. System Prompt — Fixed allocation based on agent type (500-2,000 tokens)
  2. Knowledge Base Context — Up to 3,000 tokens from RAG retrieval (configurable)
  3. Conversation History — Dynamically sized; older messages are truncated as the window fills
  4. User Message — The current incoming message
  5. Response Space — Reserved for the model’s output (256 tokens for voice, up to 1,024 for text)

Conversation Memory

The platform maintains conversation history within the context window:

Setting Default Description
Max History Messages 20 Maximum number of previous messages included in context
Max History Tokens 10,000 Maximum total tokens allocated to conversation history
Truncation Strategy Sliding Window Oldest messages are dropped first when limits are reached
Summary Injection Disabled When enabled, older messages are summarized instead of dropped

Cost Comparison

Per-Token Pricing (Approximate)

Model Input (per 1M tokens) Output (per 1M tokens) Notes
GPT-4o $2.50 $10.00 Premium model for complex tasks
GPT-4o-mini $0.15 $0.60 Primary workhorse model
Claude 3.5 Sonnet $3.00 $15.00 High-quality reasoning
Claude 3.5 Haiku $0.80 $4.00 Fast and capable fallback
Gemini 2.0 Flash $0.10 $0.40 Cost-effective with large context
Gemini 2.0 Flash-Lite $0.05 $0.20 Ultra-low cost for simple tasks
Llama 3.3 70B $0.20 $0.20 Open-source, competitive pricing
text-embedding-3-small $0.02 N/A Embeddings (input only)

How This Maps to Credits

4Geeks abstracts raw token pricing into a simplified credit system:

Interaction Type Credits Consumed Approximate Token Equivalent
Text Message ~40 credits ~1,000-2,000 total tokens (input + output)
Voice Minute ~400 credits Includes STT, LLM, TTS, and telephony costs
Knowledge Base Query Included in message cost Retrieval is bundled with the text message
Website Analysis ~200 credits One-time cost per analysis
Embedding (per document) ~10 credits per 100 pages One-time cost during document processing

Note

Credit consumption is an estimation. Actual costs vary based on the model selected, conversation complexity, knowledge base retrieval depth, and the number of tool calls executed during an interaction.

Model Selection Best Practices

For Agent Creators

When configuring a custom agent, consider these guidelines:

Scenario Recommended Model Why
High-volume WhatsApp support GPT-4o-mini (default) Best quality-to-cost ratio for conversational text
Premium customer experience GPT-4o Superior reasoning for complex, high-value interactions
Voice agent (phone) GPT-4o-mini (Voice) Optimized for low-latency voice conversations
Multilingual (Spanish/Portuguese) GPT-4o-mini or Llama 3.3 70B Both have strong multilingual capabilities
Long document analysis Claude 3.5 Sonnet or Gemini 2.0 Flash Large context windows (200K-1M tokens)
Cost-sensitive campaigns Gemini 2.0 Flash-Lite Lowest cost for simple, repetitive tasks
Maximum speed Llama 3.3 70B via Groq Fastest inference available

For Developers

When building custom integrations via the API:

{
  "model": "gpt-4o-mini",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant..."},
    {"role": "user", "content": "What are your business hours?"}
  ],
  "temperature": 0.7,
  "max_tokens": 1024,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "check_availability",
        "description": "Check business availability"
      }
    }
  ]
}

The API is fully compatible with the OpenAI API format, making it easy to switch between models without changing your integration code.

Model Updates & Versioning

The 4Geeks platform continuously evaluates and updates its model lineup:

  • Seamless Upgrades — When a new model version is released (e.g., GPT-4o-mini-2024-07-18 → GPT-4o-mini), the Gateway automatically routes to the latest version
  • Backward Compatibility — The OpenAI-compatible API format ensures that model upgrades do not break existing integrations
  • A/B Testing — New models are tested against current models using A/B testing before full rollout
  • Regression Testing — Every model update is validated against a benchmark suite of agent interactions to ensure quality does not decrease
  • Changelog — Significant model changes are documented in the platform changelog

What’s Next


AĂşn con dudas? Pregunta en Discord o explore tutoriales