LLM Models¶
4Geeks AI Agents leverages a multi-model architecture through the Private AI Gateway, dynamically selecting the optimal Large Language Model for each interaction. This document provides a detailed breakdown of every model available, how they are used, and how the platform decides which model to use for a given task.
Multi-Model Orchestration¶
The 4Geeks Private AI Gateway does not rely on a single LLM provider. Instead, it orchestrates across multiple providers to ensure the best combination of quality, speed, and cost for each interaction.
graph LR
subgraph Request["Incoming Request"]
MSG[User Message]
CTX[Context Type]
CFG[Agent Config]
end
subgraph Router["Gateway Router"]
AN[Analyzer]
SEL[Model Selector]
LB[Load Balancer]
FB[Fallback Handler]
end
subgraph Providers["LLM Providers"]
OA[OpenAI]
ANTH[Anthropic]
GOOG[Google]
META[Meta / Llama]
GR[Groq]
end
MSG --> AN
CTX --> AN
CFG --> AN
AN --> SEL
SEL --> LB
LB --> OA
LB --> ANTH
LB --> GOOG
LB --> META
LB --> GR
OA --> FB
ANTH --> FB
GOOG --> FB
META --> FB
GR --> FB How Model Selection Works¶
For every request, the Gateway evaluates several factors:
| Factor | Description |
|---|---|
| Task Type | Text conversation, voice response, website analysis, embedding, or tool execution |
| Latency Requirement | Voice interactions require ultra-low latency; text can tolerate slightly longer processing |
| Complexity | Simple FAQs use lightweight models; complex reasoning tasks use frontier models |
| Agent Configuration | Each agent can specify preferred models or allow dynamic selection |
| Cost Optimization | The gateway prefers cost-effective models when quality requirements are met |
| Provider Health | If a provider is experiencing issues, traffic is automatically rerouted |
| Rate Limits | Requests are distributed across API keys to avoid hitting provider rate limits |
Automatic Fallbacks¶
If the primary model or provider is unavailable, the Gateway automatically falls back to an equivalent model:
graph TD
A[Primary Model Request] --> B{Success?}
B -->|Yes| C[Return Response]
B -->|No - Timeout| D[Try Secondary Model]
B -->|No - Error| E[Try Alternate Provider]
D --> F{Success?}
F -->|Yes| C
F -->|No| G[Try Tertiary Model]
E --> H{Success?}
H -->|Yes| C
H -->|No| G
G --> I{Success?}
I -->|Yes| C
I -->|No| J[Return Graceful Error] Fallback chain example for a text agent:
- Primary: OpenAI GPT-4o-mini (fast, cost-effective)
- Secondary: Anthropic Claude 3.5 Haiku (similar capability)
- Tertiary: Google Gemini 2.0 Flash (alternative provider)
- Final: Return a graceful error message to the user
Available Models¶
OpenAI Models¶
OpenAI provides the primary family of models used across the platform for text and voice interactions.
| Model | Parameters | Context Window | Best For | Latency | Relative Cost |
|---|---|---|---|---|---|
| GPT-4o | ~200B (est.) | 128K tokens | Complex reasoning, nuanced conversations, long-document analysis | Medium | High |
| GPT-4o-mini | ~8B (est.) | 128K tokens | General-purpose text agents, FAQs, lead qualification, most voice interactions | Low | Low |
| GPT-4o-mini (Voice) | ~8B (est.) | 128K tokens | Optimized for real-time voice conversations with tool calling | Very Low | Low |
| text-embedding-3-small | N/A | 8,191 tokens | Knowledge base embeddings, semantic search | Very Low | Very Low |
| text-embedding-3-large | N/A | 8,191 tokens | High-precision embeddings for critical knowledge bases | Very Low | Low |
GPT-4o¶
The flagship OpenAI model, used for tasks requiring deep reasoning and nuanced understanding:
- Use Cases: Complex customer inquiries requiring multi-step reasoning, document summarization, contract analysis, nuanced sales conversations
- Strengths: Excellent at following complex instructions, strong at structured output (JSON), multilingual capability, vision support
- Limitations: Higher latency and cost compared to mini variants
- Default Temperature: 0.7
GPT-4o-mini¶
The workhorse model that handles the majority of agent interactions:
- Use Cases: Standard text conversations, WhatsApp messaging, lead qualification, FAQ answering, appointment scheduling, most voice interactions
- Strengths: Very fast response times, excellent cost-efficiency, 128K context window, strong tool-calling capability
- Limitations: Less capable than GPT-4o on very complex reasoning tasks
- Default Temperature: 0.7
text-embedding-3-small¶
The default embedding model for the Knowledge Base (RAG) system:
- Dimensions: 1,536
- Use Cases: Converting knowledge base documents into vector embeddings for semantic search
- Strengths: Excellent balance of quality and speed, cost-effective for large document collections
- Limitations: Lower precision than the large variant for highly specialized domains
Anthropic Claude Models¶
Anthropic’s Claude models are used for tasks requiring strong safety guardrails and careful instruction following.
| Model | Context Window | Best For | Latency | Relative Cost |
|---|---|---|---|---|
| Claude 3.5 Sonnet | 200K tokens | Complex analysis, long-document processing, careful reasoning | Medium | High |
| Claude 3.5 Haiku | 200K tokens | Fast general-purpose tasks, fallback for GPT-4o-mini | Low | Medium |
Claude 3.5 Sonnet¶
Used selectively for high-stakes or complex tasks:
- Use Cases: Legal document analysis, compliance-related conversations, complex multi-step workflows, tasks requiring very careful instruction following
- Strengths: Excellent safety characteristics, strong at nuanced reasoning, 200K context window for processing long documents, very good at structured output
- Limitations: Higher cost, slightly slower than GPT-4o-mini for simple tasks
- Default Temperature: 0.7
Claude 3.5 Haiku¶
Serves as both a primary model and a critical fallback:
- Use Cases: General-purpose conversations, fallback when OpenAI services are degraded, tasks requiring strong safety properties
- Strengths: Very fast, good quality, strong safety guardrails, large context window
- Limitations: Less capable than Sonnet on complex reasoning
- Default Temperature: 0.7
Google Gemini Models¶
Google’s Gemini models provide additional diversity and are particularly strong in multimodal tasks.
| Model | Context Window | Best For | Latency | Relative Cost |
|---|---|---|---|---|
| Gemini 2.0 Flash | 1M tokens | Very long context tasks, multilingual conversations | Low | Medium |
| Gemini 2.0 Flash-Lite | 1M tokens | High-volume simple tasks, cost-sensitive operations | Very Low | Very Low |
Gemini 2.0 Flash¶
Selected for tasks requiring extremely large context windows:
- Use Cases: Processing very long documents, multilingual conversations where language detection is important, tasks requiring broad world knowledge
- Strengths: Massive 1M token context window, strong multilingual capability, fast inference, good at code understanding
- Limitations: Less precise than GPT-4o on complex structured output
- Default Temperature: 0.7
Gemini 2.0 Flash-Lite¶
Used for high-volume, cost-sensitive operations:
- Use Cases: Bulk message processing, simple FAQ responses, high-traffic campaigns where cost efficiency is critical
- Strengths: Extremely fast, very low cost, adequate quality for straightforward tasks
- Limitations: Not suitable for complex reasoning or nuanced conversations
- Default Temperature: 0.7
Meta Llama Models¶
Open-source Llama models provide an additional layer of redundancy and are used for specific tasks.
| Model | Parameters | Context Window | Best For | Latency | Relative Cost |
|---|---|---|---|---|---|
| Llama 3.3 70B | 70B | 128K tokens | General-purpose tasks, open-source preference | Medium | Low |
| Llama 3.1 8B | 8B | 128K tokens | Simple tasks, high-volume processing | Very Low | Very Low |
Llama 3.3 70B¶
Used as both a primary and fallback model:
- Use Cases: General conversations, tasks where open-source model preference is required, fallback for proprietary models
- Strengths: Strong open-source model, good multilingual support (especially Spanish and Portuguese), no vendor lock-in
- Limitations: Slightly less capable than GPT-4o on complex structured tasks
- Default Temperature: 0.7
Groq LPU Models¶
Groq’s Language Processing Units (LPUs) deliver the fastest inference speeds available, making them ideal for latency-sensitive tasks.
| Model | Best For | Latency | Relative Cost |
|---|---|---|---|
| Groq Compound Model | Website analysis, multi-step research tasks | Low | Medium |
| Llama 3.3 70B (via Groq) | Ultra-fast text generation, real-time applications | Very Low | Low |
Groq Compound Model¶
A specialized compound AI system used exclusively for website analysis and prompt generation:
- Use Cases: Website scraping and analysis, automatic system prompt generation, multi-step research tasks that combine web search, code interpretation, and text generation
- Components:
- Web Search — Searches the web for additional business context
- Code Interpreter — Processes and structures scraped data
- LLM — Generates the final system prompt
- Strengths: Compound reasoning across multiple tools, extremely fast execution, purpose-built for the website analysis workflow
- Limitations: Specialized use case; not available for general agent conversations
Llama 3.3 70B via Groq¶
The same Llama model running on Groq’s custom hardware for ultra-low latency:
- Use Cases: Tasks where response speed is the top priority, real-time applications, fallback when other providers are slow
- Strengths: Fastest available inference (often < 100ms time-to-first-token), cost-effective
- Limitations: Same capability limitations as the base Llama model
- Default Temperature: 0.7
Model Usage by Feature¶
The following table shows which models are used for each platform feature:
| Feature | Primary Model | Fallback Model(s) | Reason |
|---|---|---|---|
| Text Agents (WhatsApp) | GPT-4o-mini | Claude 3.5 Haiku, Gemini 2.0 Flash | Best balance of quality, speed, and cost for conversational text |
| Voice Agents (Phone) | GPT-4o-mini (Voice) | Llama 3.3 70B via Groq | Ultra-low latency required for natural voice conversations |
| Complex Reasoning | GPT-4o | Claude 3.5 Sonnet | Deep reasoning for complex multi-step tasks |
| Website Analysis | Groq Compound Model | N/A (specialized) | Purpose-built compound AI for web analysis |
| Knowledge Base Embeddings | text-embedding-3-small | text-embedding-3-large | High-quality embeddings at scale |
| Knowledge Base Retrieval | GPT-4o-mini | Claude 3.5 Haiku | Fast synthesis of retrieved context into responses |
| Workflow Automation | GPT-4o-mini | Gemini 2.0 Flash-Lite | Cost-effective processing of structured workflows |
| Campaign Scripts | GPT-4o-mini | Llama 3.3 70B | Good quality for outbound call scripts |
Model Configuration¶
Default Settings¶
Each agent type has optimized default model settings:
| Setting | Text Agents | Voice Agents | Custom Agents |
|---|---|---|---|
| Model | GPT-4o-mini | GPT-4o-mini (Voice) | Configurable |
| Temperature | 0.7 | 0.7 | Configurable |
| Max Tokens | 1,024 | 256 | Configurable |
| Top-P | 1.0 | 1.0 | Configurable |
| Frequency Penalty | 0.0 | 0.0 | Configurable |
| Presence Penalty | 0.0 | 0.0 | Configurable |
Temperature Guide¶
The temperature parameter controls the randomness of the model’s output:
| Temperature | Behavior | Best For |
|---|---|---|
| 0.0 - 0.2 | Deterministic, consistent responses | Factual Q&A, data lookup, compliance responses |
| 0.3 - 0.5 | Mostly consistent with some variation | Customer support, appointment scheduling, collections |
| 0.5 - 0.7 | Balanced creativity and consistency | Sales conversations, marketing content, general chat |
| 0.7 - 0.9 | Creative and varied responses | Content generation, brainstorming, creative marketing |
| 0.9 - 1.0 | Highly creative, unpredictable | Creative writing, ideation (not recommended for customer-facing) |
Context Window Management¶
Different models have different context windows. The platform manages context efficiently:
graph LR
subgraph Context["Context Window (128K tokens)"]
SP[System Prompt<br/>~500-2,000 tokens]
KC[Knowledge Base Context<br/>~1,500-3,000 tokens]
CH[Conversation History<br/>~2,000-20,000 tokens]
UM[User Message<br/>~50-500 tokens]
RS[Response Space<br/>~256-1,024 tokens]
end Context allocation strategy:
- System Prompt — Fixed allocation based on agent type (500-2,000 tokens)
- Knowledge Base Context — Up to 3,000 tokens from RAG retrieval (configurable)
- Conversation History — Dynamically sized; older messages are truncated as the window fills
- User Message — The current incoming message
- Response Space — Reserved for the model’s output (256 tokens for voice, up to 1,024 for text)
Conversation Memory¶
The platform maintains conversation history within the context window:
| Setting | Default | Description |
|---|---|---|
| Max History Messages | 20 | Maximum number of previous messages included in context |
| Max History Tokens | 10,000 | Maximum total tokens allocated to conversation history |
| Truncation Strategy | Sliding Window | Oldest messages are dropped first when limits are reached |
| Summary Injection | Disabled | When enabled, older messages are summarized instead of dropped |
Cost Comparison¶
Per-Token Pricing (Approximate)¶
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Notes |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | Premium model for complex tasks |
| GPT-4o-mini | $0.15 | $0.60 | Primary workhorse model |
| Claude 3.5 Sonnet | $3.00 | $15.00 | High-quality reasoning |
| Claude 3.5 Haiku | $0.80 | $4.00 | Fast and capable fallback |
| Gemini 2.0 Flash | $0.10 | $0.40 | Cost-effective with large context |
| Gemini 2.0 Flash-Lite | $0.05 | $0.20 | Ultra-low cost for simple tasks |
| Llama 3.3 70B | $0.20 | $0.20 | Open-source, competitive pricing |
| text-embedding-3-small | $0.02 | N/A | Embeddings (input only) |
How This Maps to Credits¶
4Geeks abstracts raw token pricing into a simplified credit system:
| Interaction Type | Credits Consumed | Approximate Token Equivalent |
|---|---|---|
| Text Message | ~40 credits | ~1,000-2,000 total tokens (input + output) |
| Voice Minute | ~400 credits | Includes STT, LLM, TTS, and telephony costs |
| Knowledge Base Query | Included in message cost | Retrieval is bundled with the text message |
| Website Analysis | ~200 credits | One-time cost per analysis |
| Embedding (per document) | ~10 credits per 100 pages | One-time cost during document processing |
Note
Credit consumption is an estimation. Actual costs vary based on the model selected, conversation complexity, knowledge base retrieval depth, and the number of tool calls executed during an interaction.
Model Selection Best Practices¶
For Agent Creators¶
When configuring a custom agent, consider these guidelines:
| Scenario | Recommended Model | Why |
|---|---|---|
| High-volume WhatsApp support | GPT-4o-mini (default) | Best quality-to-cost ratio for conversational text |
| Premium customer experience | GPT-4o | Superior reasoning for complex, high-value interactions |
| Voice agent (phone) | GPT-4o-mini (Voice) | Optimized for low-latency voice conversations |
| Multilingual (Spanish/Portuguese) | GPT-4o-mini or Llama 3.3 70B | Both have strong multilingual capabilities |
| Long document analysis | Claude 3.5 Sonnet or Gemini 2.0 Flash | Large context windows (200K-1M tokens) |
| Cost-sensitive campaigns | Gemini 2.0 Flash-Lite | Lowest cost for simple, repetitive tasks |
| Maximum speed | Llama 3.3 70B via Groq | Fastest inference available |
For Developers¶
When building custom integrations via the API:
{
"model": "gpt-4o-mini",
"messages": [
{"role": "system", "content": "You are a helpful assistant..."},
{"role": "user", "content": "What are your business hours?"}
],
"temperature": 0.7,
"max_tokens": 1024,
"tools": [
{
"type": "function",
"function": {
"name": "check_availability",
"description": "Check business availability"
}
}
]
}
The API is fully compatible with the OpenAI API format, making it easy to switch between models without changing your integration code.
Model Updates & Versioning¶
The 4Geeks platform continuously evaluates and updates its model lineup:
- Seamless Upgrades — When a new model version is released (e.g., GPT-4o-mini-2024-07-18 → GPT-4o-mini), the Gateway automatically routes to the latest version
- Backward Compatibility — The OpenAI-compatible API format ensures that model upgrades do not break existing integrations
- A/B Testing — New models are tested against current models using A/B testing before full rollout
- Regression Testing — Every model update is validated against a benchmark suite of agent interactions to ensure quality does not decrease
- Changelog — Significant model changes are documented in the platform changelog
What’s Next¶
- Cloud & LLM Architecture — Full technical architecture of the platform
- The Human Team — How human experts orchestrate and improve AI agents
- Knowledge Base (RAG) — Configure domain-specific knowledge for your agents
- Pricing & Credits Model — Understand how credit consumption maps to model usage
AĂşn con dudas? Pregunta en Discord o explore tutoriales