Inference

A model is just a file until you run it somewhere. Inference is the infrastructure layer that turns weights into answers — and the deployment pattern you choose shapes everything from cost to compliance.

Four Deployment Patterns

Most production systems use more than one of these. The trick is knowing which pattern fits which workload — and when to mix them.

Simplest path to production

Cloud APIs

Call a model through an API endpoint — OpenAI, Anthropic, Google, Mistral, and others. No infrastructure to manage, no GPUs to provision. You send a request, you get a response. This is where most projects start, and for many workloads it's where they should stay.

  • Pay per token, scale instantly
  • Always running the latest model versions
  • Data leaves your network on every call
Control cost and latency

Self-Hosted

Run open-weight models on your own cloud infrastructure using serving engines like vLLM, Ollama, or TGI. You pick the model, the hardware, and the configuration. More work upfront, but you control the economics and can optimise for your specific workload.

  • Fixed compute cost, predictable at scale
  • Tune batch sizes, quantisation, and caching
  • You own the ops — updates, scaling, monitoring
Data sovereignty and air-gapped

On-Prem

Models running on hardware you physically control — in your data centre, behind your firewall. This is the answer when regulation or policy says data cannot leave the building. It's the most expensive pattern to operate, but sometimes it's the only compliant option.

  • Full data sovereignty — nothing leaves the network
  • Required for air-gapped and classified environments
  • Highest capital and operational cost
On-device, low latency

Edge

Small, quantised models running directly on phones, laptops, IoT devices, or embedded hardware. No network round-trip, no API cost per call. The tradeoff is capability — edge models are smaller and less powerful, but for the right use case they're unbeatable on latency and privacy.

  • Zero network dependency — works offline
  • Sub-millisecond inference latency
  • Limited to smaller, quantised models

What's the Difference?

Each pattern makes different tradeoffs. Here's how they compare on the dimensions that matter most.

Cloud APIs Self-Hosted On-Prem Edge
Cost model Pay per token Fixed compute Capital + ops Device hardware
Latency Network-bound Tuneable Low (local network) Lowest
Data sovereignty Data leaves network Your cloud tenant Full control On-device only
Setup complexity Minutes Hours to days Weeks to months Varies widely
Model selection Provider's catalogue Any open-weight model Any open-weight model Small/quantised only
Scaling Automatic Manual or auto-scale Hardware-bound Per-device

In practice, I often recommend a hybrid approach: cloud APIs for development and bursty workloads, self-hosted for steady-state production traffic, and on-prem or edge only where compliance or latency demands it.

Need help choosing a deployment pattern?

I help teams figure out which inference strategy fits their workload, budget, and compliance requirements — and build the infrastructure to make it real.