Data Warehouses & Data Lakes

When your AI stack generates enough data, you need purpose-built infrastructure to store and query it. Data warehouses handle structured analytics. Data lakes handle the messy, unstructured reality of AI outputs. Most mature deployments need both.

Data Warehouses

Structured, schema-enforced storage optimised for analytical queries. If you need to answer questions like "what was our average response quality score last quarter" or "which agent handled the most tasks this month," this is where those answers live.

Data warehouses excel at structured data — tables with defined columns, consistent types, and relationships you can join on. In an AI context, that means metrics, evaluation scores, cost data, usage analytics, and any structured output your agents produce. The key advantage is queryability: once data lands in a warehouse, your BI tools, dashboards, and reporting pipelines can work with it directly.

Analytics on Agent Performance

Track response latency, quality scores, error rates, and task completion rates across agents, models, and time periods. Warehouses make this data queryable for dashboards and trend analysis.

Business Intelligence

Connect AI operational data to business outcomes. How does prompt template A compare to template B in terms of user satisfaction? Which workflows generate the most value? Warehouses let you answer these questions with SQL.

Compliance & Auditing

Regulated industries need structured, queryable records of every AI decision. Warehouses provide the schema enforcement, retention policies, and audit trails that compliance teams require.

Cost Attribution

Token usage, API costs, and compute spend — broken down by team, project, model, and task type. Without structured cost data in a warehouse, budget conversations are guesswork.

Data Lakes

Unstructured and semi-structured storage for everything that doesn't fit neatly into tables. Logs, documents, embeddings, raw model outputs, audio transcripts, images — the messy reality of what AI systems actually produce.

Data lakes store data in its native format without requiring you to define a schema upfront. That flexibility is essential for AI workloads, where outputs range from structured JSON to freeform text to high-dimensional embedding vectors. The trade-off is that lakes are harder to query directly — you typically need processing pipelines to extract structured insights from lake data.

Raw Conversation Logs

Full conversation transcripts with metadata, tool calls, and intermediate reasoning. Too unstructured for a warehouse, too valuable to discard. Lakes hold the raw material that gets processed into structured analytics later.

Embedding Stores

High-dimensional vector representations of documents, chunks, and queries. These are the backbone of RAG systems and similarity search. Lakes (often backed by vector databases) are the natural home for this data.

Generated Artefacts

Reports, documents, images, code files, and other outputs that agents create. These need to be stored, versioned, and made searchable — but they don't conform to any single schema.

Training & Evaluation Datasets

Collected examples for fine-tuning, evaluation benchmarks, and few-shot libraries. Often a mix of input-output pairs, human annotations, and quality labels. Lakes hold the raw data; pipelines transform it into training-ready formats.

Warehouse vs Lake vs Both

The choice isn't either/or. Most production AI systems need both — and the interesting design work is in how data flows between them.

Start with a lake

If you're early in your AI deployment, start by storing everything in a lake. Raw logs, outputs, conversations — capture it all. You don't know yet which questions you'll want to answer, so don't constrain yourself with schemas too early. You can always extract structured data into a warehouse later.

Add a warehouse when you need dashboards

Once you have recurring questions — "how much did we spend last week," "what's our average eval score" — that's when a warehouse earns its keep. Build ETL pipelines that extract structured metrics from the lake into warehouse tables.

The lakehouse pattern

Modern architectures blur the line. Lakehouse platforms (Databricks, Delta Lake, Apache Iceberg) let you run SQL queries directly on lake data with warehouse-like performance. If you're building from scratch, this is often the pragmatic choice.

Don't over-engineer early

I've seen teams spend months designing warehouse schemas before they have meaningful data. Start simple: object storage for raw data, a lightweight database for structured metrics. Scale the infrastructure when the data volume and query complexity demand it.

How It Connects to the AI Stack

Warehouses and lakes aren't isolated infrastructure — they're load-bearing components of the feedback loop that makes AI systems improve over time.

The connection points are concrete. Evaluation datasets stored in the lake feed back into your eval framework, telling you whether model or prompt changes actually helped. Agent output logs in the warehouse power dashboards that reveal which workflows are working and which need tuning. Conversation archives get mined for patterns that improve prompt templates and system prompts. Cost data in the warehouse drives budget decisions about which models to use where.

Analytics on agent outputs

Structured metrics from agent runs — latency, token counts, quality scores, error rates — land in the warehouse. This is the data that powers operational dashboards and informs decisions about model selection, prompt tuning, and resource allocation.

Training data pipelines

Raw outputs and human annotations in the lake get processed into training-ready datasets. Whether you're fine-tuning a model or building a few-shot example library, the lake is the source of truth for what your system has produced and how humans rated it.

Evaluation datasets

Curated input-output pairs with quality labels, stored in the lake and versioned like code. Every time you change a prompt or swap a model, these datasets tell you whether you made things better or worse. Without them, you're optimising blind.

Compliance archives

Regulated workflows need immutable records of every AI decision, the context that informed it, and the output it produced. Warehouses enforce the structure and retention policies. Lakes hold the raw evidence.

Where It Fits in the Broader Stack

Data warehouses and lakes are the deep storage layer. They sit beneath the operational storage (conversation databases, prompt libraries) and above the analytics and reporting tools that consume their data.

Think of it as three tiers. The operational tier handles real-time reads and writes — your conversation database, your vector store, your prompt template system. The analytical tier is the warehouse and lake — optimised for batch queries, historical analysis, and large-scale data processing. The intelligence tier sits on top — dashboards, eval frameworks, and pipelines that mine the analytical tier to improve the operational tier.

Getting data flowing between these tiers is where the architecture work happens. The good news is that the patterns are well-established. ETL pipelines, change data capture, scheduled batch exports — none of this is novel infrastructure. The AI-specific challenge is handling the volume and variety of data that modern AI systems produce, and designing schemas flexible enough to accommodate new output types as your stack evolves.

Need help with your data infrastructure?

I help teams design the storage architecture that turns AI outputs into compounding assets — from data lake design to warehouse schemas to the pipelines that connect them to the rest of the stack.