LLM Models¶

4Geeks AI Agents leverages a multi-model architecture through the Private AI Gateway, dynamically selecting the optimal Large Language Model for each interaction. This document provides a detailed breakdown of every model available, how they are used, and how the platform decides which model to use for a given task.

Multi-Model Orchestration¶

The 4Geeks Private AI Gateway does not rely on a single LLM provider. Instead, it orchestrates across multiple providers to ensure the best combination of quality, speed, and cost for each interaction.

graph LR
    subgraph Request["Incoming Request"]
        MSG[User Message]
        CTX[Context Type]
        CFG[Agent Config]
    end

    subgraph Router["Gateway Router"]
        AN[Analyzer]
        SEL[Model Selector]
        LB[Load Balancer]
        FB[Fallback Handler]
    end

    subgraph Providers["LLM Providers"]
        OA[OpenAI]
        ANTH[Anthropic]
        GOOG[Google]
        META[Meta / Llama]
        GR[Groq]
    end

    MSG --> AN
    CTX --> AN
    CFG --> AN
    AN --> SEL
    SEL --> LB
    LB --> OA
    LB --> ANTH
    LB --> GOOG
    LB --> META
    LB --> GR
    OA --> FB
    ANTH --> FB
    GOOG --> FB
    META --> FB
    GR --> FB

How Model Selection Works¶

For every request, the Gateway evaluates several factors:

Factor	Description
Task Type	Text conversation, voice response, website analysis, embedding, or tool execution
Latency Requirement	Voice interactions require ultra-low latency; text can tolerate slightly longer processing
Complexity	Simple FAQs use lightweight models; complex reasoning tasks use frontier models
Agent Configuration	Each agent can specify preferred models or allow dynamic selection
Cost Optimization	The gateway prefers cost-effective models when quality requirements are met
Provider Health	If a provider is experiencing issues, traffic is automatically rerouted
Rate Limits	Requests are distributed across API keys to avoid hitting provider rate limits

Automatic Fallbacks¶

If the primary model or provider is unavailable, the Gateway automatically falls back to an equivalent model:

graph TD
    A[Primary Model Request] --> B{Success?}
    B -->|Yes| C[Return Response]
    B -->|No - Timeout| D[Try Secondary Model]
    B -->|No - Error| E[Try Alternate Provider]
    D --> F{Success?}
    F -->|Yes| C
    F -->|No| G[Try Tertiary Model]
    E --> H{Success?}
    H -->|Yes| C
    H -->|No| G
    G --> I{Success?}
    I -->|Yes| C
    I -->|No| J[Return Graceful Error]

Fallback chain example for a text agent:

Primary: OpenAI GPT-4o-mini (fast, cost-effective)
Secondary: Anthropic Claude 3.5 Haiku (similar capability)
Tertiary: Google Gemini 2.0 Flash (alternative provider)
Final: Return a graceful error message to the user

Available Models¶

OpenAI Models¶

OpenAI provides the primary family of models used across the platform for text and voice interactions.

Model	Parameters	Context Window	Best For	Latency	Relative Cost
GPT-4o	~200B (est.)	128K tokens	Complex reasoning, nuanced conversations, long-document analysis	Medium	High
GPT-4o-mini	~8B (est.)	128K tokens	General-purpose text agents, FAQs, lead qualification, most voice interactions	Low	Low
GPT-4o-mini (Voice)	~8B (est.)	128K tokens	Optimized for real-time voice conversations with tool calling	Very Low	Low
text-embedding-3-small	N/A	8,191 tokens	Knowledge base embeddings, semantic search	Very Low	Very Low
text-embedding-3-large	N/A	8,191 tokens	High-precision embeddings for critical knowledge bases	Very Low	Low

GPT-4o¶

The flagship OpenAI model, used for tasks requiring deep reasoning and nuanced understanding:

Use Cases: Complex customer inquiries requiring multi-step reasoning, document summarization, contract analysis, nuanced sales conversations
Strengths: Excellent at following complex instructions, strong at structured output (JSON), multilingual capability, vision support
Limitations: Higher latency and cost compared to mini variants
Default Temperature: 0.7

GPT-4o-mini¶

The workhorse model that handles the majority of agent interactions:

Use Cases: Standard text conversations, WhatsApp messaging, lead qualification, FAQ answering, appointment scheduling, most voice interactions
Strengths: Very fast response times, excellent cost-efficiency, 128K context window, strong tool-calling capability
Limitations: Less capable than GPT-4o on very complex reasoning tasks
Default Temperature: 0.7

text-embedding-3-small¶

The default embedding model for the Knowledge Base (RAG) system:

Dimensions: 1,536
Use Cases: Converting knowledge base documents into vector embeddings for semantic search
Strengths: Excellent balance of quality and speed, cost-effective for large document collections
Limitations: Lower precision than the large variant for highly specialized domains

Anthropic Claude Models¶

Anthropic’s Claude models are used for tasks requiring strong safety guardrails and careful instruction following.

Model	Context Window	Best For	Latency	Relative Cost
Claude 3.5 Sonnet	200K tokens	Complex analysis, long-document processing, careful reasoning	Medium	High
Claude 3.5 Haiku	200K tokens	Fast general-purpose tasks, fallback for GPT-4o-mini	Low	Medium

Claude 3.5 Sonnet¶

Used selectively for high-stakes or complex tasks:

Use Cases: Legal document analysis, compliance-related conversations, complex multi-step workflows, tasks requiring very careful instruction following
Strengths: Excellent safety characteristics, strong at nuanced reasoning, 200K context window for processing long documents, very good at structured output
Limitations: Higher cost, slightly slower than GPT-4o-mini for simple tasks
Default Temperature: 0.7

Claude 3.5 Haiku¶

Serves as both a primary model and a critical fallback:

Use Cases: General-purpose conversations, fallback when OpenAI services are degraded, tasks requiring strong safety properties
Strengths: Very fast, good quality, strong safety guardrails, large context window
Limitations: Less capable than Sonnet on complex reasoning
Default Temperature: 0.7

Google Gemini Models¶

Google’s Gemini models provide additional diversity and are particularly strong in multimodal tasks.

Model	Context Window	Best For	Latency	Relative Cost
Gemini 2.0 Flash	1M tokens	Very long context tasks, multilingual conversations	Low	Medium
Gemini 2.0 Flash-Lite	1M tokens	High-volume simple tasks, cost-sensitive operations	Very Low	Very Low

Gemini 2.0 Flash¶

Selected for tasks requiring extremely large context windows:

Use Cases: Processing very long documents, multilingual conversations where language detection is important, tasks requiring broad world knowledge
Strengths: Massive 1M token context window, strong multilingual capability, fast inference, good at code understanding
Limitations: Less precise than GPT-4o on complex structured output
Default Temperature: 0.7

Gemini 2.0 Flash-Lite¶

Used for high-volume, cost-sensitive operations:

Use Cases: Bulk message processing, simple FAQ responses, high-traffic campaigns where cost efficiency is critical
Strengths: Extremely fast, very low cost, adequate quality for straightforward tasks
Limitations: Not suitable for complex reasoning or nuanced conversations
Default Temperature: 0.7

Meta Llama Models¶

Open-source Llama models provide an additional layer of redundancy and are used for specific tasks.

Model	Parameters	Context Window	Best For	Latency	Relative Cost
Llama 3.3 70B	70B	128K tokens	General-purpose tasks, open-source preference	Medium	Low
Llama 3.1 8B	8B	128K tokens	Simple tasks, high-volume processing	Very Low	Very Low

Llama 3.3 70B¶

Used as both a primary and fallback model:

Use Cases: General conversations, tasks where open-source model preference is required, fallback for proprietary models
Strengths: Strong open-source model, good multilingual support (especially Spanish and Portuguese), no vendor lock-in
Limitations: Slightly less capable than GPT-4o on complex structured tasks
Default Temperature: 0.7

Groq LPU Models¶

Groq’s Language Processing Units (LPUs) deliver the fastest inference speeds available, making them ideal for latency-sensitive tasks.

Model	Best For	Latency	Relative Cost
Groq Compound Model	Website analysis, multi-step research tasks	Low	Medium
Llama 3.3 70B (via Groq)	Ultra-fast text generation, real-time applications	Very Low	Low

Groq Compound Model¶

A specialized compound AI system used exclusively for website analysis and prompt generation:

Use Cases: Website scraping and analysis, automatic system prompt generation, multi-step research tasks that combine web search, code interpretation, and text generation
Components:
- Web Search — Searches the web for additional business context
- Code Interpreter — Processes and structures scraped data
- LLM — Generates the final system prompt
Strengths: Compound reasoning across multiple tools, extremely fast execution, purpose-built for the website analysis workflow
Limitations: Specialized use case; not available for general agent conversations

Llama 3.3 70B via Groq¶

The same Llama model running on Groq’s custom hardware for ultra-low latency:

Use Cases: Tasks where response speed is the top priority, real-time applications, fallback when other providers are slow
Strengths: Fastest available inference (often < 100ms time-to-first-token), cost-effective
Limitations: Same capability limitations as the base Llama model
Default Temperature: 0.7

Model Usage by Feature¶

The following table shows which models are used for each platform feature:

Feature	Primary Model	Fallback Model(s)	Reason
Text Agents (WhatsApp)	GPT-4o-mini	Claude 3.5 Haiku, Gemini 2.0 Flash	Best balance of quality, speed, and cost for conversational text
Voice Agents (Phone)	GPT-4o-mini (Voice)	Llama 3.3 70B via Groq	Ultra-low latency required for natural voice conversations
Complex Reasoning	GPT-4o	Claude 3.5 Sonnet	Deep reasoning for complex multi-step tasks
Website Analysis	Groq Compound Model	N/A (specialized)	Purpose-built compound AI for web analysis
Knowledge Base Embeddings	text-embedding-3-small	text-embedding-3-large	High-quality embeddings at scale
Knowledge Base Retrieval	GPT-4o-mini	Claude 3.5 Haiku	Fast synthesis of retrieved context into responses
Workflow Automation	GPT-4o-mini	Gemini 2.0 Flash-Lite	Cost-effective processing of structured workflows
Campaign Scripts	GPT-4o-mini	Llama 3.3 70B	Good quality for outbound call scripts

Model Configuration¶

Default Settings¶

Each agent type has optimized default model settings:

Setting	Text Agents	Voice Agents	Custom Agents
Model	GPT-4o-mini	GPT-4o-mini (Voice)	Configurable
Temperature	0.7	0.7	Configurable
Max Tokens	1,024	256	Configurable
Top-P	1.0	1.0	Configurable
Frequency Penalty	0.0	0.0	Configurable
Presence Penalty	0.0	0.0	Configurable

Temperature Guide¶

The temperature parameter controls the randomness of the model’s output:

Temperature	Behavior	Best For
0.0 - 0.2	Deterministic, consistent responses	Factual Q&A, data lookup, compliance responses
0.3 - 0.5	Mostly consistent with some variation	Customer support, appointment scheduling, collections
0.5 - 0.7	Balanced creativity and consistency	Sales conversations, marketing content, general chat
0.7 - 0.9	Creative and varied responses	Content generation, brainstorming, creative marketing
0.9 - 1.0	Highly creative, unpredictable	Creative writing, ideation (not recommended for customer-facing)

Context Window Management¶

Different models have different context windows. The platform manages context efficiently:

graph LR
    subgraph Context["Context Window (128K tokens)"]
        SP[System Prompt<br/>~500-2,000 tokens]
        KC[Knowledge Base Context<br/>~1,500-3,000 tokens]
        CH[Conversation History<br/>~2,000-20,000 tokens]
        UM[User Message<br/>~50-500 tokens]
        RS[Response Space<br/>~256-1,024 tokens]
    end

Context allocation strategy:

System Prompt — Fixed allocation based on agent type (500-2,000 tokens)
Knowledge Base Context — Up to 3,000 tokens from RAG retrieval (configurable)
Conversation History — Dynamically sized; older messages are truncated as the window fills
User Message — The current incoming message
Response Space — Reserved for the model’s output (256 tokens for voice, up to 1,024 for text)

Conversation Memory¶

The platform maintains conversation history within the context window:

Setting	Default	Description
Max History Messages	20	Maximum number of previous messages included in context
Max History Tokens	10,000	Maximum total tokens allocated to conversation history
Truncation Strategy	Sliding Window	Oldest messages are dropped first when limits are reached
Summary Injection	Disabled	When enabled, older messages are summarized instead of dropped

Cost Comparison¶

Per-Token Pricing (Approximate)¶

Model	Input (per 1M tokens)	Output (per 1M tokens)	Notes
GPT-4o	$2.50	$10.00	Premium model for complex tasks
GPT-4o-mini	$0.15	$0.60	Primary workhorse model
Claude 3.5 Sonnet	$3.00	$15.00	High-quality reasoning
Claude 3.5 Haiku	$0.80	$4.00	Fast and capable fallback
Gemini 2.0 Flash	$0.10	$0.40	Cost-effective with large context
Gemini 2.0 Flash-Lite	$0.05	$0.20	Ultra-low cost for simple tasks
Llama 3.3 70B	$0.20	$0.20	Open-source, competitive pricing
text-embedding-3-small	$0.02	N/A	Embeddings (input only)

How This Maps to Credits¶

4Geeks abstracts raw token pricing into a simplified credit system:

Interaction Type	Credits Consumed	Approximate Token Equivalent
Text Message	~40 credits	~1,000-2,000 total tokens (input + output)
Voice Minute	~400 credits	Includes STT, LLM, TTS, and telephony costs
Knowledge Base Query	Included in message cost	Retrieval is bundled with the text message
Website Analysis	~200 credits	One-time cost per analysis
Embedding (per document)	~10 credits per 100 pages	One-time cost during document processing

Note

Credit consumption is an estimation. Actual costs vary based on the model selected, conversation complexity, knowledge base retrieval depth, and the number of tool calls executed during an interaction.

Model Selection Best Practices¶

For Agent Creators¶

When configuring a custom agent, consider these guidelines:

Scenario	Recommended Model	Why
High-volume WhatsApp support	GPT-4o-mini (default)	Best quality-to-cost ratio for conversational text
Premium customer experience	GPT-4o	Superior reasoning for complex, high-value interactions
Voice agent (phone)	GPT-4o-mini (Voice)	Optimized for low-latency voice conversations
Multilingual (Spanish/Portuguese)	GPT-4o-mini or Llama 3.3 70B	Both have strong multilingual capabilities
Long document analysis	Claude 3.5 Sonnet or Gemini 2.0 Flash	Large context windows (200K-1M tokens)
Cost-sensitive campaigns	Gemini 2.0 Flash-Lite	Lowest cost for simple, repetitive tasks
Maximum speed	Llama 3.3 70B via Groq	Fastest inference available

For Developers¶

When building custom integrations via the API:

{
  "model": "gpt-4o-mini",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant..."},
    {"role": "user", "content": "What are your business hours?"}
  ],
  "temperature": 0.7,
  "max_tokens": 1024,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "check_availability",
        "description": "Check business availability"
      }
    }
  ]
}

The API is fully compatible with the OpenAI API format, making it easy to switch between models without changing your integration code.

Model Updates & Versioning¶

The 4Geeks platform continuously evaluates and updates its model lineup:

Seamless Upgrades — When a new model version is released (e.g., GPT-4o-mini-2024-07-18 → GPT-4o-mini), the Gateway automatically routes to the latest version
Backward Compatibility — The OpenAI-compatible API format ensures that model upgrades do not break existing integrations
A/B Testing — New models are tested against current models using A/B testing before full rollout
Regression Testing — Every model update is validated against a benchmark suite of agent interactions to ensure quality does not decrease
Changelog — Significant model changes are documented in the platform changelog

What’s Next¶

Cloud & LLM Architecture — Full technical architecture of the platform
The Human Team — How human experts orchestrate and improve AI agents
Knowledge Base (RAG) — Configure domain-specific knowledge for your agents
Pricing & Credits Model — Understand how credit consumption maps to model usage

Aún con dudas? Pregunta en Discord o explore tutoriales