Scaling AI-generated apps: performance, testing, and CI/CD
AI can write your first feature, but scaling it is engineering. If you ship a learning platform builder AI, expect spiky traffic, non-deterministic outputs, and strict SLAs. Here's how to design for speed and safety while keeping iteration velocity high.
Architecture: Supabase vs custom backend with AI
Supabase shines for teams who need fast auth, row-level security, realtime, and Postgres + pgvector without plumbing. Choose it when product discovery is active and schemas evolve weekly. Go custom when you need GPU-aware scheduling, multi-region inference, or cross-tenant rate economics. A pragmatic split: Supabase for auth, data, and triggers; custom microservice for inference, prompt routing, and billing.

- Case: An enterprise course hub served 1M lessons/day by keeping users, courses, and progress in Supabase, while a Go service handled embeddings and cache warming.
- Guardrail: keep AI calls idempotent; retries should not duplicate writes.
Database builder with relationships
Model relationships explicitly to control cost and latency. For an AI course marketplace, define learners, courses, modules, enrollments, sessions, and llm_calls with foreign keys and ON DELETE rules. Precompute aggregates (completion_rate) via triggers, and store LLM outputs and evaluation scores separately for auditability.

- Indexes: (tenant_id, updated_at desc) for dashboards; GIN for JSONB metadata.
- Partition sessions by tenant_id to bound VACUUM and backup windows.
- Use soft deletes; hard deletes break analytics lineages.
Performance levers
- Cache hierarchy: CDN for static course assets; edge KV for roster lookups; Redis for feature flags and embeddings; per-request local cache for prompts.
- Batch: group 32 embedding writes or tool calls; you'll reduce p95 by 20-40%.
- Time budgets: enforce 300ms for DB, 200ms for cache, 1.5s for model; degrade gracefully to extractive search.
- Prompt profiles: small, medium, large; pick via policy, not ad-hoc string hacking.
Testing AI behavior
- Golden set: 500 inputs with expected JSON schemas and quality labels; run nightly and on PR.
- Property tests: fuzz user input to assert invariants (no PII leak, valid schema, latency ceilings).
- Record/replay: capture model responses behind a flag to stabilize CI.
- Offline vs online: offline BLEU/rouge isn't business value; online metrics are enrollments, completions, and support deflection.
CI/CD and governance
- Ephemeral envs: spin branch databases via Supabase branches or Docker; seed with synthetic tenants.
- Migrations: gate on lint + shadow DB diff; fail if it drops columns without backfill.
- Progressive delivery: canary 5%, watch p95, TTFT, cost/req; auto-rollback on SLO breach.
- Version prompts and tools; ship via feature flags with kill switches.
- Observability: OpenTelemetry spans across DB, cache, model; tag with tenant and model ID.
Security quick hits
Use RLS everywhere, rotate keys, encrypt llm_calls, and isolate tenants in queues. Document data flows; auditors love diagrams and deny-by-default policies.
Measure cost per session and renegotiate model tiers quarterly.



