System Architecture Overview¶

This document describes the high-level architecture of the AI Assistant project, including its components, data flow, and design principles.

Architecture Diagram¶

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Client Apps   │◄──►│   FastAPI API    │◄──►│   Agent System  │
│ (OpenWebUI,     │    │  (OpenAI-compat) │    │   (LangChain)   │
│  Custom Apps)   │    │                  │    │                 │
└─────────────────┘    └──────────────────┘    └─────────────────┘
         │                       │                       │
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Monitoring    │    │   Tool System    │◄──►│  Provider Layer │
│  (Prometheus,   │    │  (Extensible)    │    │ (Multi-Provider)│
│   Grafana)      │    │                  │    │                 │
└─────────────────┘    └──────────────────┘    └─────────────────┘
         │                       │                       │
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Caching Layer │    │   Storage Layer  │    │  External APIs  │
│  (Multi-layer)  │    │ (Redis, Session) │    │ (OpenRouter,    │
│                 │    │                  │    │  OpenAI, etc.)  │
└─────────────────┘    └──────────────────┘    └─────────────────┘

Core Components¶

1. API Layer (FastAPI)¶

Purpose: Provide OpenAI-compatible interface
Technology: FastAPI with Pydantic models

Features
Full OpenAI API compatibility
Streaming and non-streaming responses
Comprehensive error handling
OpenAPI documentation
CORS support
Request validation and sanitization

2. Agent System (LangChain)¶

Purpose: Orchestrate LLM interactions and intelligent tool calling
Technology: LangChain with custom agents

Features
Multi-provider model support
Advanced tool calling capabilities
Context-aware tool selection
Conversation memory management
Response streaming
Fallback and error recovery

3. Tool System (Extensible)¶

Purpose: Extend AI capabilities with specialized tools
Technology: Modular tool architecture with registry

Built-in Tools
Calculator (mathematical operations)
Time Tool (current time and date functions)
SearXNG Search (privacy-focused web search)
Echo Tool (testing and debugging)
Custom tool development framework

4. Provider Layer¶

Purpose: Abstract multiple LLM providers behind unified interface
Technology: Provider abstraction with fallback mechanisms

Supported Providers
OpenAI (GPT models)
OpenRouter (multiple models)
Anthropic (Claude models)
Together AI (open-source models)
Ollama (local models)
Any OpenAI-compatible API

5. Caching Layer¶

Purpose: Optimize performance and reduce API costs
Technology: Multi-layer caching with compression

Features
In-memory caching (Redis)
Response compression
Intelligent cache invalidation
Cache statistics and monitoring

6. Monitoring Layer¶

Purpose: System observability and performance tracking
Technology: Prometheus metrics with Grafana dashboards

Features
Real-time metrics collection
Custom dashboards
Health checks and alerts
Performance analytics

Data Flow¶

Standard Chat Flow¶

Request Reception: Client sends chat request to /v1/chat/completions
Message Processing: Convert OpenAI format to LangChain messages
Agent Execution: LangChain agent processes request with available tools
Tool Execution: If needed, tools are called to gather information
Response Generation: LLM generates response based on context
Response Formatting: Convert LangChain response to OpenAI format
Streaming: Send response chunks back to client

Tool Calling Flow¶

Tool Detection: Agent determines if tools are needed
Tool Selection: Choose appropriate tool based on query
Tool Execution: Run tool with parameters
Result Integration: Combine tool results with conversation context
Response Generation: Generate final response with tool insights

Design Principles¶

1. OpenAI Compatibility¶

Full compliance with OpenAI API specification
Support for both streaming and non-streaming responses
Standard error codes and response formats

2. Extensibility¶

Modular tool system for adding new capabilities
Plugin architecture for custom integrations
Configuration-driven behavior

3. Security First¶

No hardcoded API keys or secrets
Environment-based configuration
Input validation and sanitization
Regular security scanning

4. Performance¶

Async/await for non-blocking operations
Connection pooling for external APIs
Caching strategies for frequent operations
Efficient vector search algorithms

Technology Stack¶

Backend¶

Framework: FastAPI (Python 3.12)
LLM Orchestration: LangChain with multi-provider support
Caching: Redis with multi-layer caching and compression
API Client: HTTPX for async requests
Web Interface: Gradio for configuration and testing
Search Integration: SearXNG for privacy-focused web search

Development Tools¶

Package Manager: UV for fast dependency management
Testing: pytest with comprehensive coverage
Code Quality: ruff, black, mypy
Security: bandit, pip-audit
Documentation: MkDocs + Material theme

Infrastructure¶

CI/CD: GitHub Actions with security scanning
Containerization: Docker with docker-compose
Monitoring: Prometheus with Grafana dashboards
Reverse Proxy: Traefik for service routing
Database: Redis for caching, PostgreSQL (optional)

Configuration Management¶

Environment-Based Configuration¶

# app/core/config.py
class Settings(BaseSettings):
    openrouter_api_key: Optional[SecretStr] = None
    openrouter_base_url: str = "https://openrouter.ai/api/v1"
    default_model: str = "anthropic/claude-3.5-sonnet"
    # ... other settings

Security Considerations¶

API keys stored as SecretStr
Environment variables for sensitive data
Validation of all configuration values
Secure defaults for production

Scalability Considerations¶

Horizontal Scaling¶

Stateless API design
External session storage (planned)
Load balancer compatibility

Performance Optimization¶

Connection pooling for database and API calls
Caching layer for frequent queries
Async processing for I/O operations

Monitoring and Observability¶

Structured logging
Performance metrics collection
Health check endpoints
Error tracking and alerting

Deployment Architecture¶

Development Environment¶

Local execution with hot reload
Mock external services for testing
Detailed logging and debugging

Production Environment¶

Containerized deployment
Environment-specific configuration
Health monitoring and auto-recovery
Scalable infrastructure

Integration Patterns¶

External API Integration¶

Async HTTP clients with retry logic
Circuit breaker pattern for resilience
Rate limiting and backoff strategies

Tool Integration¶

Standardized tool interface
Error handling and fallbacks
Performance monitoring

Future Architecture Evolution¶

Phase 1: Core Stability¶

[x] Basic OpenAI-compatible API
[x] LangChain integration
[ ] Tool system foundation

Phase 2: Advanced Features¶

[ ] SearX web search integration
[ ] RAG knowledge base
[ ] Advanced tool capabilities

Phase 3: Production Ready¶

[ ] Docker containerization
[ ] Advanced monitoring
[ ] High availability setup

Decision Log¶

Technology Choices¶

FastAPI: Chosen for performance, async support, and automatic OpenAPI docs
LangChain: Industry standard for LLM orchestration with extensive tooling
PostgreSQL + pgvector: Robust, scalable vector database solution
UV: Fast, modern Python package manager with excellent dependency resolution

Architecture Decisions¶

OpenAI Compatibility: Ensures wide compatibility with existing tools
Modular Tool System: Allows incremental feature development
Async-First Design: Optimal for I/O-heavy LLM operations
Security-First Approach: Protects sensitive API keys and user data

This architecture provides a solid foundation for building a powerful, extensible AI assistant while maintaining security, performance, and developer experience.