Multi-Provider Setup¶
This guide explains how to configure and use multiple LLM providers with the AI Assistant System.
Overview¶
The AI Assistant System supports multiple LLM providers, allowing you to: - Use different models for different tasks - Implement failover between providers - Load balance requests across providers - Optimize for cost and performance
Supported Providers¶
- OpenAI
- Anthropic Claude
- Ollama (local models)
- Azure OpenAI
- Custom OpenAI-compatible endpoints
Configuration¶
Environment Variables¶
Configure multiple providers using environment variables:
# Primary provider
PRIMARY_PROVIDER=openai
OPENAI_API_KEY=your_openai_key
OPENAI_BASE_URL=https://api.openai.com/v1
# Secondary provider
SECONDARY_PROVIDER=anthropic
ANTHROPIC_API_KEY=your_anthropic_key
# Local provider
LOCAL_PROVIDER=ollama
OLLAMA_BASE_URL=http://localhost:11434
# Azure OpenAI
AZURE_OPENAI_API_KEY=your_azure_key
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_VERSION=2023-12-01-preview
# Custom provider
CUSTOM_PROVIDER_NAME=myprovider
CUSTOM_PROVIDER_API_KEY=your_custom_key
CUSTOM_PROVIDER_BASE_URL=https://api.custom-provider.com/v1
Provider Configuration File¶
Create a providers.yaml configuration file:
providers:
openai:
type: openai
api_key: ${OPENAI_API_KEY}
base_url: ${OPENAI_BASE_URL}
models:
- gpt-4
- gpt-3.5-turbo
rate_limit:
requests_per_minute: 60
tokens_per_minute: 90000
retry:
max_attempts: 3
backoff_factor: 2
anthropic:
type: anthropic
api_key: ${ANTHROPIC_API_KEY}
models:
- claude-3-opus-20240229
- claude-3-sonnet-20240229
rate_limit:
requests_per_minute: 50
tokens_per_minute: 100000
retry:
max_attempts: 3
backoff_factor: 2
ollama:
type: ollama
base_url: ${OLLAMA_BASE_URL}
models:
- llama2
- codellama
- mistral
rate_limit:
requests_per_minute: 30
retry:
max_attempts: 2
backoff_factor: 1.5
azure_openai:
type: azure_openai
api_key: ${AZURE_OPENAI_API_KEY}
endpoint: ${AZURE_OPENAI_ENDPOINT}
api_version: ${AZURE_OPENAI_API_VERSION}
models:
- gpt-4
- gpt-35-turbo
rate_limit:
requests_per_minute: 120
tokens_per_minute: 120000
retry:
max_attempts: 3
backoff_factor: 2
strategies:
default:
provider: openai
model: gpt-4
fallback_providers:
- anthropic
- ollama
code_generation:
provider: anthropic
model: claude-3-opus-20240229
fallback_providers:
- openai
local_development:
provider: ollama
model: llama2
fallback_providers:
- openai
cost_optimized:
provider: openai
model: gpt-3.5-turbo
fallback_providers:
- ollama
Provider Selection Strategies¶
Round Robin¶
Distribute requests evenly across providers:
from app.core.llm_providers import ProviderManager, RoundRobinStrategy
provider_manager = ProviderManager()
strategy = RoundRobinStrategy(providers=["openai", "anthropic", "ollama"])
provider_manager.set_strategy(strategy)
Weighted Round Robin¶
Distribute requests based on weights:
from app.core.llm_providers import WeightedRoundRobinStrategy
strategy = WeightedRoundRobinStrategy(
providers={
"openai": 0.5,
"anthropic": 0.3,
"ollama": 0.2
}
)
provider_manager.set_strategy(strategy)
Least Cost¶
Select provider based on cost:
from app.core.llm_providers import LeastCostStrategy
strategy = LeastCostStrategy()
provider_manager.set_strategy(strategy)
Performance Based¶
Select provider based on response time:
from app.core.llm_providers import PerformanceBasedStrategy
strategy = PerformanceBasedStrategy()
provider_manager.set_strategy(strategy)
Failover Configuration¶
Automatic Failover¶
Configure automatic failover when a provider fails:
from app.core.llm_providers import FailoverConfig
failover_config = FailoverConfig(
primary_provider="openai",
fallback_providers=["anthropic", "ollama"],
health_check_interval=60, # seconds
max_failures=3,
recovery_timeout=300 # seconds
)
provider_manager.configure_failover(failover_config)
Health Checks¶
Implement health checks for providers:
from app.core.llm_providers import HealthChecker
health_checker = HealthChecker()
@health_checker.register("openai")
async def check_openai_health():
try:
# Simple health check
response = await openai_client.completions.create(
model="gpt-3.5-turbo",
prompt="test",
max_tokens=1
)
return True
except Exception:
return False
# Schedule health checks
health_checker.schedule_checks(interval=60)
Load Balancing¶
Request Load Balancing¶
Balance requests across multiple provider instances:
from app.core.llm_providers import LoadBalancer
load_balancer = LoadBalancer(
providers=["openai", "anthropic"],
algorithm="least_connections",
health_checker=health_checker
)
# Use load balancer for requests
response = await load_balancer.complete(
prompt="Explain quantum computing",
max_tokens=1000
)
Token Load Balancing¶
Balance based on token usage:
from app.core.llm_providers import TokenLoadBalancer
token_balancer = TokenLoadBalancer(
providers={
"openai": {"max_tokens": 100000, "current_usage": 0},
"anthropic": {"max_tokens": 80000, "current_usage": 0},
"ollama": {"max_tokens": 50000, "current_usage": 0}
}
)
Cost Management¶
Cost Tracking¶
Track costs across providers:
from app.core.llm_providers import CostTracker
cost_tracker = CostTracker()
@cost_tracker.track
async def generate_response(prompt, provider, model):
response = await provider.complete(prompt=prompt, model=model)
cost_tracker.add_cost(provider, model, response.usage)
return response
# Get cost report
cost_report = cost_tracker.get_report()
print(f"Total cost: ${cost_report.total_cost}")
print(f"Cost by provider: {cost_report.by_provider}")
Budget Alerts¶
Set budget alerts:
from app.core.llm_providers import BudgetAlert
budget_alert = BudgetAlert(
daily_budget=10.0, # $10 per day
monthly_budget=300.0, # $300 per month
alert_threshold=0.8 # Alert at 80% of budget
)
@budget_alert.check_budget
async def check_costs():
current_cost = cost_tracker.get_daily_cost()
if current_cost > budget_alert.daily_threshold:
await send_alert(f"Daily budget threshold reached: ${current_cost}")
Monitoring Multi-Provider Setup¶
Provider Metrics¶
Monitor metrics for each provider:
from prometheus_client import Counter, Histogram
PROVIDER_REQUESTS = Counter(
'provider_requests_total',
'Total requests by provider',
['provider', 'status']
)
PROVIDER_RESPONSE_TIME = Histogram(
'provider_response_time_seconds',
'Response time by provider',
['provider']
)
PROVIDER_COST = Counter(
'provider_cost_total',
'Total cost by provider',
['provider']
)
Dashboard¶
Create a Grafana dashboard to monitor: - Request distribution across providers - Response times by provider - Error rates by provider - Cost tracking by provider - Provider health status
Best Practices¶
- Test Failover: Regularly test failover mechanisms
- Monitor Costs: Keep track of costs across providers
- Set Limits: Configure appropriate rate limits
- Health Checks: Implement regular health checks
- Document Setup: Document your multi-provider configuration
- Start Small: Begin with a simple setup and expand as needed
- Review Performance: Regularly review provider performance
Troubleshooting¶
Common Issues¶
- Provider Unavailable: Check health checks and failover configuration
- High Latency: Consider using faster providers or load balancing
- Cost Overruns: Monitor costs and set budget alerts
- Rate Limits: Configure appropriate rate limits and retry strategies
Debug Mode¶
Enable debug mode for detailed logging:
```python import logging logging.getLogger("app.core.llm_providers").setLevel(logging.DEBUG)