AI Agents & RAG: Architectures for Enterprise Web Apps

AI Agents and RAG for Enterprise Web Apps: Reference Architectures, Tooling, and Traps

Enterprises want AI assistants that actually ship, scale, and comply. The fastest path combines robust retrieval-augmented generation (RAG) with pragmatic agent patterns, delivered via Infrastructure as code for web apps. Below is a field guide based on deployments I've led with managed development teams, Arc.dev vetted developers, and partners like slashdev.io who provide remote engineers and software agency expertise for businesses and startups.

Reference architecture 1: Stateless API and dedicated vector service

Ideal for customer support, knowledge search, and SEO content generation. The web app calls a stateless API that orchestrates: embedding service, vector database, model gateway, and a short-lived cache. Indexing runs as an asynchronous job with backpressure. Keep the API stateless to let autoscalers do their job; put session state in Redis or DynamoDB only if you truly need tool memory.

Core pieces: API (FastAPI/Express), embedding worker (Batch and queue), vector DB (pgvector, Weaviate, or Milvus), reranker (Cross-encoder), model gateway (OpenAI/Azure/Anthropic), observability (OpenTelemetry, Prometheus, Grafana), policy engine (OPA or Cedar).
Deployment: containerized, fronted by an API gateway and WAF; blue/green releases with canaries; regional replication if RTO/RPO matter.
Strengths: deterministic retrieval, easy cost control, clear scaling boundaries.
Tradeoffs: fewer tool calls; you must design prompts to cope with limited context.

Reference architecture 2: Agentic workflow orchestrator

When tasks span multiple tools-CRM, analytics, CMS-use a stateful graph. Orchestrate with LangGraph, Temporal, or Durable Functions. Each node runs a tool: retrieval, planner, executor, verifier, and guardrail. Keep tool definitions declarative and versioned. Snapshot intermediate state to object storage for auditability and replay.

Abstract illustration of AI with silhouette head full of eyes, symbolizing observation and technology. — Photo by Tara Winstead on Pexels

Hard rules: cap max tool depth; enforce structured function calling; add a verifier model or regex check before side effects.
Resilience: circuit breakers for flaky APIs; idempotent tool design; retries with exponential backoff; human-in-the-loop lanes for high-risk actions.
Latency: prefetch embeddings and warm prompt caches; use async fan-out for parallel retrieval.

Infrastructure as code for web apps: what great looks like

Codify everything: networks, service meshes, queues, secrets, model endpoints, and vector stores. Use Terraform modules or Pulumi programs that expose opinionated defaults-private subnets, VPC endpoints, autoscaling policies, and egress firewalls. Helm charts pin runtime versions and liveness probes. Environments are ephemeral: every pull request can spin a short-lived stack with synthetic data. Document everything as living runbooks.

A vibrant and artistic representation of neural networks in an abstract 3D render, showcasing technology concepts. — Photo by Google DeepMind on Pexels

Secrets: sealed secrets or SOPS with KMS; rotate model keys every 30 days; scope keys per environment.
Delivery: canary 5/25/100 with automated rollback on quality regression; use feature flags for prompt changes.
Drift: scheduled plan/apply with OPA checks; fail builds if public egress appears unexpectedly.

Data pipeline for retrieval quality

RAG is only as good as your documents. Chunk by semantic boundaries (headers, list items) with adaptive size; store hierarchical metadata for breadcrumb prompting. Create doc-type specific embedding configs: code, PDFs, tickets, and product specs require different tokenization and cleaning. Use a two-stage retrieval: ANN recall and cross-encoder rerank. Maintain an evaluation harness with golden questions and perturbations (typos, partial dates, acronyms).

Smartphone displaying AI app with book on AI technology in background. — Photo by Sanket Mishra on Pexels

Metrics: precision@k, coverage, groundedness variance, hallucination rate, answer latency, cost per answer, and freshness SLA.
Index hygiene: delta upserts, nightly compaction, orphan detector for deleted sources, and shadow indexes for safe re-embeds.

Tooling stack that scales

Reliable choices: OpenAI or Azure OpenAI for general models; Anthropic for long-context; local LLMs for private data when egress is restricted. For retrieval, start with pgvector if you live on Postgres and graduate to Weaviate, Pinecone, or Milvus when you need hybrid search or multi-tenant isolation. Re-ranking with Cohere, Jina, or open cross-encoders. Frameworks: LlamaIndex for indexing pipelines; LangChain for tool wiring; BentoML or vLLM for hosting custom models. Observability with Arize or WhyLabs for LLM traces and quality dashboards.

Security and governance first

Classify and redact PII at ingest; use policy-based masking in retrieval. Bind all services to private networks; restrict egress with domain allowlists. Encrypt vector stores with managed KMS. Enforce RBAC on collections and audit prompts and completions. For regulated workloads, run model inference in VPC-peered endpoints and maintain data residency.

Pitfalls to avoid

Over-chunking: tiny chunks destroy context; prefer 300-800 tokens with overlap tuned per domain.
Underspecified schemas: tool functions without JSON schemas lead to brittle calls; validate and coerce.
Non-deterministic prompts in production: freeze prompts and templates; experiment behind flags.
Index staleness: tie re-embeds to source change events; watch freshness SLA, not calendar time.
Cost drift: log token usage per tenant; alert on spikes; add cache keys that include instruction prompts.
One-size models: route by task-classification small model, generation larger model; apply distillation later.