AI Architecture9 min read

Building Production RAG Systems: Lessons Learned

What breaks in enterprise RAG — and what to design for instead

IM

Iulian Mihai

Principal Cloud Architect & AI Innovation Leader

A secure shield — a metaphor for production-grade RAG governance and guardrails
🎧 Listen to this article
Building Production RAG Systems: Lessons Learned
6:17
0:000:00

Most RAG demos work beautifully right up until you put them in front of real users, real data, and real compliance requirements.

I learned this the hard way while building production Retrieval-Augmented Generation systems at MEM.Zone. What worked in notebooks failed under load. What looked “cloud-native” on slides collapsed under governance. And what the documentation implied was possible turned out to be… optimistic.

RAG is not a feature. It’s an architecture. And in enterprise environments, architecture is where most AI initiatives quietly die.

RAG fails when you treat it like an LLM problem

The biggest mistake I see teams make is framing RAG as a model selection exercise.

Which embedding model? Which LLM? Which prompt template?

Those choices matter, but they’re not where production systems break. They break in ingestion pipelines, identity boundaries, vector lifecycle management, and operational visibility.

At MEM.Zone, the model was never the bottleneck. The system around it was.

A datacenter corridor — a metaphor for production constraints and operational discipline
In production, the hard part is rarely the model — it’s everything around it.

Data ingestion is the real system

In production, ingestion is not a batch job you run once.

Documents change. Metadata changes. Access rules change. Entire datasets get reclassified overnight because legal noticed something late in the process.

We ended up treating ingestion as a first-class pipeline, not a pre-step.

Azure Functions handled document normalization. Azure Data Factory orchestrated bulk reprocessing. Storage accounts were split by sensitivity level, not by convenience. Each ingestion run produced traceable artifacts: document hash, chunk count, embedding version.

This wasn’t over-engineering. It was survival.

Without this, you don’t know what the model is actually answering from.

Chunking strategy is a governance decision

Chunk size is often discussed as a retrieval optimization.

In practice, it’s a governance decision.

Smaller chunks improve recall but increase the risk of leaking partial context across security boundaries. Larger chunks reduce retrieval accuracy but are easier to reason about from an access-control perspective.

We settled on deterministic chunking with explicit document lineage. Every chunk knew where it came from, which policy applied to it, and who was allowed to retrieve it.

This ruled out several “smart” chunking libraries early on. They optimized for relevance, not accountability.

In regulated environments, accountability wins.

Vector stores are not just databases

Vector databases are often sold as infrastructure components.

In reality, they become compliance-critical systems very quickly.

At MEM.Zone, we evaluated multiple options before settling on a managed vector store deployed strictly in EU regions. Some offerings were technically superior but relied on global control planes. That alone disqualified them.

We also discovered an undocumented behavior: certain managed services replicate metadata outside the primary region for monitoring. The embeddings stayed local. The metadata didn’t.

That matters under GDPR.

We ended up isolating vector storage per environment and per tenant, even though it increased cost. It made audit conversations much shorter.

Identity has to reach the retriever

Most RAG diagrams stop at “retrieve relevant chunks.”

That’s not enough in enterprise systems.

Retrieval must be identity-aware. The retriever needs to know who is asking, not just what they are asking.

We propagated Entra ID claims all the way into the retrieval layer. Queries were filtered before similarity scoring, not after. This avoided a whole class of “we retrieved it but didn’t show it” bugs that still count as data exposure in some regulatory interpretations.

This is one of those details Azure documentation doesn’t spell out, but auditors care.

Prompt engineering is the least interesting part

We spent far less time on prompts than most people expect.

Once retrieval is clean, scoped, and predictable, prompts become boring. That’s a good thing.

The real work was in grounding responses with citations, enforcing deterministic system messages, and logging every model interaction with enough context to replay incidents later.

When a business user says “the assistant gave me the wrong answer,” you need to know exactly why.

Without traceability, you’re guessing.

Observability is not optional

RAG systems fail silently.

Latency creeps up. Recall degrades. Cost spikes. Users lose trust long before anything breaks.

We instrumented everything. Embedding generation time. Vector query latency. Top-k hit rates. Token consumption per request. Failed retrievals.

Azure Monitor and Application Insights were sufficient, but only because we treated AI components like any other production service. No special treatment. No magic.

If you can’t see it, you can’t defend it in front of leadership.

Cost control comes from architecture, not prompts

RAG costs don’t explode because of the LLM.

They explode because teams re-embed entire corpora unnecessarily, retrieve too much context, or allow uncontrolled query patterns.

We versioned embeddings explicitly. No silent upgrades. No background reprocessing.

We capped retrieval depth and enforced hard token budgets. Not because finance asked us to, but because predictability is part of system quality.

FinOps conversations go much better when the architecture already respects limits.

What I’d do again — and what I wouldn’t

I would absolutely build RAG again in production. The value is real when done correctly.

I would not rush into it without governance, identity, and observability in place. Every shortcut taken early shows up later as a blocker.

The pattern that emerged was clear: successful RAG systems look much more like enterprise platforms than AI experiments.

That’s uncomfortable for teams hoping AI will simplify things. In reality, it raises the bar.

And that’s fine, as long as you design for it from day one.

If you’re building RAG for a regulated or complex environment: treat ingestion, identity, governance, and observability as the product — and let prompts be the boring part.

Want a sanity check on your architecture?

If you want a second opinion on a production RAG design (governance, identity, retrieval, observability, and cost), you can start from:

Tags

#RAG#LLMs#AzureOpenAI#AIGovernance#Observability#Security#Evaluation#VectorSearch

Need Help with Your Multi-Cloud Strategy?

I've helped Fortune 500 companies design and implement multi-cloud architectures that deliver real business value. Let's discuss how I can help your organization.

Book a Consultation

Nu știi de unde să începi?