Inference
Inference is where the AI actually runs — turning prompts into outputs. Where you run inference determines your cost, latency, data exposure, and resilience.
Inference Options
Each approach has different trade-offs across cost, control, and capability.
Cloud API
Call commercial model APIs (OpenAI, Anthropic, Google). No infrastructure to manage, pay per token. You get the latest models immediately but send data to third parties.
Self-Hosted
Run models on your own GPUs using inference servers like vLLM or Ollama. Higher upfront investment, but data never leaves your infrastructure and per-token cost drops at scale.
Edge
Run small, quantized models directly on devices or local servers. Ultra-low latency, works offline, but limited to smaller models with less capability.
Hybrid
Route requests based on sensitivity and complexity. Sensitive data goes to self-hosted models, general queries go to cloud APIs. An inference gateway handles the routing.
Value Pathways
Strategic benefits of understanding and controlling your inference layer.
Cost Optimisation
Not every task needs a frontier model. Routing simple classification to a small model and complex reasoning to GPT-4 can cut costs by 80% with minimal quality loss.
Latency Control
Self-hosted inference eliminates network round-trips. Edge inference is near-instant. When your agents need to respond in milliseconds, inference location matters.
Data Sovereignty
Some data can't leave your jurisdiction — regulatory requirements, client contracts, or internal policy. Self-hosted and air-gapped inference keeps data where it needs to be.
Resilience
Depending on a single cloud API creates a single point of failure. Multi-provider routing and local fallbacks ensure your systems keep running when APIs go down.
Security Postures
How this works across different deployment models and security requirements.
Call commercial model APIs — OpenAI, Anthropic, Google. No GPU infrastructure to manage, pay per token.
Run models on your own GPUs using inference servers. Data never leaves your infrastructure. Higher upfront cost, full control.
Fully isolated inference with no network connectivity. Models run on dedicated hardware behind physical security controls.
Route requests to cloud APIs for general tasks, self-hosted models for sensitive data. An inference gateway handles the routing.
Need help with your inference strategy?
I design inference architectures that balance cost, performance, and security requirements.