Agentic AI in Production: Where Most Architectures Break

I reviewed three agentic AI pilots last quarter that all looked impressive in demo. Every single one failed the moment security got involved.

They all had the same flaw. The AI could do too much, and nobody could explain how to stop it.

“Can it do something?” is where things break

The first version always works.

A chatbot over a few documents. Maybe a RAG setup on Azure AI Foundry. Some GPT-4o responses that look convincing enough. People get excited.

Then someone asks the only question that matters.

“Can it actually take action?”

That’s where the architecture collapses if you didn’t design for it from day one.

I’ve built systems where the AI doesn’t just answer. It reads enterprise data, triggers workflows, generates outputs, sends emails, and orchestrates multiple API calls in sequence. Not as a script. As a decision-making loop.

That’s the moment your AI stops being harmless.

And becomes an attack surface.

The capability problem nobody wants to admit

The uncomfortable truth is simple.

The more useful your AI becomes, the more dangerous it becomes.

I’ve seen teams obsess over whether Azure OpenAI trains on their data. That’s the wrong conversation. In enterprise environments, especially under GDPR and internal audit pressure, the real questions are operational.

What can it access. What can it trigger. What identity it uses. And how fast you can kill it when something goes wrong.

If you can’t answer those in one sentence each, you’re not ready for production.

Datacenter corridor — a metaphor for the infrastructure discipline required by production agentic AI systems — In production, the hard part is never the model — it’s everything around it.

The only identity that works is not human

We killed the “use the user token” approach early.

It looks clean on paper. The AI acts on behalf of the user. Permissions are inherited. No extra identity to manage.

It fails instantly under audit.

You can’t explain who did what. You can’t limit blast radius. And if something leaks, you’ve just handed full user impersonation to a system you don’t fully control.

The only model that survived security review was a dedicated robot identity.

A service account with its own permissions. Its own audit trail. Its own lifecycle.

I’ve implemented this in Azure using Managed Identity combined with downstream platform identities. The key is separation. The AI reasons. The robot acts.

The moment you remove that boundary, you lose control.

And revocation becomes theoretical instead of real.

The tool boundary is the real security layer

Most teams try to secure the prompt.

That’s a mistake.

Prompts are not security boundaries. They are suggestions to a probabilistic system.

We enforced security in code.

The model could not access anything directly. No filesystem. No environment variables. No HTTP calls. No dynamic execution. Only a fixed set of server-side tools.

Hardcoded. Versioned. Reviewed.

Each tool mapped to a specific API operation. Each one validated inputs. Each one returned structured data. Nothing more.

The important part is what we didn’t build.

There is no tool to read secrets. No tool to inspect runtime state. No tool to execute arbitrary logic.

So when someone inevitably tries prompt injection, nothing happens.

The model can “want” to do something. But it physically cannot.

That’s the difference between a demo and a system that survives a penetration test.

Secrets are not a configuration problem

We moved all secrets into Azure Key Vault with Managed Identity access.

That part is standard.

What most teams get wrong is exposure.

If your AI ever sees a token, even once, you’ve already lost.

We enforced a strict rule. Secrets are used by the server process only. They never enter the model context. They never appear in logs. They never cross the boundary.

This sounds obvious. It’s not.

I’ve reviewed implementations where debugging logs accidentally included headers. Or where tool outputs returned raw API responses with embedded tokens.

Those systems pass tests. They fail audits.

RAG is not about accuracy. It’s about liability

We didn’t add RAG to improve answers. We added it to control what the AI is allowed to say.

That’s a different goal.

In regulated environments, hallucination is not a UX problem. It’s a compliance problem.

We built a curated knowledge base. Not scraped content. Not auto-ingested docs. Hand-reviewed material.

Vector search on top. Grounded responses only.

The side effect is better answers. The real benefit is traceability.

When the AI says something, we know where it came from.

And when it’s wrong, we know what to fix.

The clean architecture didn’t survive contact with reality

The first version was elegant.

Unified identity model. Consistent API versions. Clean abstractions across services.

It lasted about two weeks.

Then we hit version mismatches across APIs that returned 404 instead of 403. Identity systems that used three different ID formats for the same user. Endpoints that silently changed behavior between regions.

Frankfurt behaved differently from Amsterdam. EMEA endpoints didn’t always match US ones.

The documentation didn’t mention any of this.

We stopped trying to make it elegant.

We made it explicit.

Every API version pinned. Every identity mapping resolved manually. Every edge case documented in code, not in Confluence.

That’s the version that actually worked.

Cost was not where we expected it

Finance escalated early.

“AI is going to explode our costs.”

It didn’t.

Most of the system doesn’t cost anything from an AI perspective.

The expensive part is not the model. It’s everything around it if you design it poorly.

We kept the architecture simple. Single App Service. No over-engineered microservices. Azure Key Vault, Storage, AI Foundry. No unnecessary layers.

The result was predictable cost. Low enough that nobody cared.

What finance actually hates is unpredictability. Not high numbers.

We gave them a ceiling instead of an estimate.

That changed the conversation.

Multi-tenant AI was rejected in the first review

We evaluated a shared deployment model.

It was efficient. Lower cost. Easier to manage.

It didn’t pass security.

Under EU constraints, especially with GDPR and internal risk committees, shared models create questions you cannot answer cleanly.

Where is the data. Who has access. What happens under incident response.

We moved to one deployment per customer.

Separate Azure resources. Separate model instances. Separate identities.

It’s heavier operationally.

It’s the only model that didn’t trigger escalation.

The system prompt became a contract

We underestimated this part.

The initial prompt was simple. “Help the user.”

That produced inconsistent behavior.

Sometimes the AI exposed internal IDs. Sometimes it didn’t. Sometimes it confirmed destructive actions. Sometimes it didn’t.

We rewrote it as a strict behavioral contract.

What to show. What to hide. When to ask for confirmation. How to present data.

This wasn’t about improving quality. It was about reducing unpredictability.

If your AI behaves differently under the same conditions, it becomes untrustworthy very quickly.

This pattern is not optional anymore

I’ve now seen enough implementations across industries to say this clearly.

If your agentic AI does not have:

a dedicated identity
a hard tool boundary
strict secret isolation
a curated knowledge base
and isolated deployment

it will not survive enterprise adoption.

It might pass a demo. It might even pass a pilot.

It won’t pass security, audit, or scale.

The part we are still watching

The system works. It’s in production. It’s stable.

But there is one thing we still monitor closely.

Not performance. Not cost. Not availability.

Behavior drift.

Over time, small changes in prompts, tools, or knowledge can create subtle shifts in how the AI makes decisions.

Nothing breaks immediately.

Until one day it does something slightly off.

And in an agentic system, “slightly off” is enough to create an incident.

Need help designing production-grade agentic AI?

I help enterprise teams design agentic AI architectures that survive security review, audit, and scale — with dedicated identities, tool boundaries, and governance built in from day one.

Book a free consultation →