Inference

Inference is where the AI actually runs — turning prompts into outputs. Where you run inference determines your cost, latency, data exposure, and resilience.

Inference Options

Each approach has different trade-offs across cost, control, and capability.

Cloud API

Call commercial model APIs (OpenAI, Anthropic, Google). No infrastructure to manage, pay per token. You get the latest models immediately but send data to third parties.

Self-Hosted

Run models on your own GPUs using inference servers like vLLM or Ollama. Higher upfront investment, but data never leaves your infrastructure and per-token cost drops at scale.

Edge

Run small, quantized models directly on devices or local servers. Ultra-low latency, works offline, but limited to smaller models with less capability.

Hybrid

Route requests based on sensitivity and complexity. Sensitive data goes to self-hosted models, general queries go to cloud APIs. An inference gateway handles the routing.

Value Pathways

Strategic benefits of understanding and controlling your inference layer.

Cost Optimisation

Not every task needs a frontier model. Routing simple classification to a small model and complex reasoning to GPT-4 can cut costs by 80% with minimal quality loss.

Latency Control

Self-hosted inference eliminates network round-trips. Edge inference is near-instant. When your agents need to respond in milliseconds, inference location matters.

Data Sovereignty

Some data can't leave your jurisdiction — regulatory requirements, client contracts, or internal policy. Self-hosted and air-gapped inference keeps data where it needs to be.

Resilience

Depending on a single cloud API creates a single point of failure. Multi-provider routing and local fallbacks ensure your systems keep running when APIs go down.

Need help with your inference strategy?

I design inference architectures that balance cost, performance, and security requirements.