Inference

Inference is where the AI actually runs — turning prompts into outputs. Where you run inference determines your cost, latency, data exposure, and resilience.

Inference Options

Each approach has different trade-offs across cost, control, and capability.

Cloud API

Call commercial model APIs (OpenAI, Anthropic, Google). No infrastructure to manage, pay per token. You get the latest models immediately but send data to third parties.

Self-Hosted

Run models on your own GPUs using inference servers like vLLM or Ollama. Higher upfront investment, but data never leaves your infrastructure and per-token cost drops at scale.

Edge

Run small, quantized models directly on devices or local servers. Ultra-low latency, works offline, but limited to smaller models with less capability.

Hybrid

Route requests based on sensitivity and complexity. Sensitive data goes to self-hosted models, general queries go to cloud APIs. An inference gateway handles the routing.

Value Pathways

Strategic benefits of understanding and controlling your inference layer.

Cost Optimisation

Not every task needs a frontier model. Routing simple classification to a small model and complex reasoning to GPT-4 can cut costs by 80% with minimal quality loss.

Latency Control

Self-hosted inference eliminates network round-trips. Edge inference is near-instant. When your agents need to respond in milliseconds, inference location matters.

Data Sovereignty

Some data can't leave your jurisdiction — regulatory requirements, client contracts, or internal policy. Self-hosted and air-gapped inference keeps data where it needs to be.

Resilience

Depending on a single cloud API creates a single point of failure. Multi-provider routing and local fallbacks ensure your systems keep running when APIs go down.

Security Postures

How this works across different deployment models and security requirements.

SaaS / API Standard

Call commercial model APIs — OpenAI, Anthropic, Google. No GPU infrastructure to manage, pay per token.

Typical tools: OpenAI API, Anthropic API, Google AI

Self-Hosted High

Run models on your own GPUs using inference servers. Data never leaves your infrastructure. Higher upfront cost, full control.

Typical tools: vLLM, Ollama, TGI, llama.cpp

Air-Gapped Maximum

Fully isolated inference with no network connectivity. Models run on dedicated hardware behind physical security controls.

Typical tools: Offline Ollama, llama.cpp, custom runtimes

Hybrid Configurable

Route requests to cloud APIs for general tasks, self-hosted models for sensitive data. An inference gateway handles the routing.

Typical tools: LiteLLM, custom gateway, mixed backends

Need help with your inference strategy?

I design inference architectures that balance cost, performance, and security requirements.

Get in Touch See Deployment Options