The short version

Most protocols that touch AI bleed money. Every submission fires 2–3 paid model calls, every retry duplicates work, every cold key triggers a thundering herd, and every outage takes the pipeline down with it. That is the default — and it is how small communities strangle themselves before they can grow.

GASCOIN is built the other way. Every AI call is routed, cached, deduplicated, and bounded by a four-layer architecture that turns paid model invocations into a predictable, auditable expense. Duplicate receipts are free. Worker retries are free. Outages fail into safety. Prompts are tunable from a dashboard without a redeploy. Monthly spend has a hard ceiling. And every single call is tagged, logged, and attributable to the feature that made it.

This is not aspirational. It is live in production, verified with real billable tokens, and rendered in the diagrams below.

The protocol moat

Three independent AI providers — Anthropic Claude (final oversight), xAI Grok (fraud reasoning), and Google Gemini (vision + receipt extraction) — run behind a unified infrastructure layer. Each is wrapped in caching and failover so that no single provider outage, rate limit, or price move can take the pipeline down or blow up the budget.

No vendor lock-in. Switching between anthropic/claude-sonnet-4.6, xai/grok-4.1-fast-reasoning, and google/gemini-3-flash is a one-string change.
No long-lived API keys. Authentication uses Vercel's short-lived OIDC tokens, auto-injected and auto-refreshed at deploy time. No keys to rotate. No keys to leak.
No surprise bills. Budget caps are enforced at the gateway, not in application code. When a cap trips, the pipeline degrades gracefully with HTTP 402 — it does not keep spending.
No silent failures. Every call is tagged feature:claude-oversight, feature:receipt-ocr, feature:grok-reasoning, etc., so spend and latency are attributable per feature in the Vercel dashboard.

The four-layer cache architecture

This is the centerpiece. Four independent caches, each doing a different job, each invalidated independently, stacked so that the average AI call on GASCOIN is partly or fully cached at two or three layers simultaneously.

           ┌──────────────────────────────────────────────────┐
           │   Next.js Route Handler                           │
           │   (Vercel Fluid Compute — elastic concurrency)    │
           │   One warm instance serves dozens of concurrent   │
           │   requests, so single-flight coalescing actually  │
           │   deduplicates at runtime — not in theory.        │
           └────────────────┬───────────────────────────────────┘
                            │
           ┌────────────────┴───────────────────┐
           │                                    │
      ┌────▼──────┐                  ┌──────────▼─────────┐
      │  Upstash  │                  │  Vercel AI Gateway  │◄── LAYER 1
      │  Redis    │                  │  caching: 'auto'    │    Provider-native
      │           │                  └──┬──┬──┬───┬────────┘    prompt caching:
      │ cacheGet  │                     │  │  │   │             · Claude (90% off)
      │ OrFetch   │                     ▼  ▼  ▼   ▼             · Gemini (75% off)
      │ (SHA-256  │               Claude Grok Gemini  …          · Grok   (50-75%)
      │  match    │                  ┌──────────────────────┐
      │  + single-│                  │  Vercel Data Cache   │◄── LAYER 3
      │  flight   │                  │  (KB rules, LB,      │    Next.js
      │  coalesce)│                  │   mem0 profiles)     │    unstable_cache
      │           │                  └──────────────────────┘    + revalidateTag
      │ LAYER 2   │                  ┌──────────────────────┐
      └───────────┘                  │  Vercel Edge Config  │◄── LAYER 4
                                     │  (prompt prefixes)   │    Sub-ms reads
                                     │  live-editable from  │    from the edge,
                                     │  the dashboard, no   │    no redeploy,
                                     │  redeploy needed.    │    ~1s propagation
                                     └──────────────────────┘

Layer 1 — Provider-native prompt caching (via Vercel AI Gateway)

What it does: Uses the cache feature built into each AI provider to avoid paying for the same tokens twice. The rulebook, tier table, fraud taxonomy, and decision matrix are sent as a stable system prompt and marked as cache breakpoints.
Cached input cost: as low as 10% of list price on Anthropic Claude (90% off), 25% on Google Gemini (75% off), and 25–50% on xAI Grok (50–75% off).
Why GASCOIN can actually use it: Every provider requires the cached prefix to be >1024 tokens. GASCOIN's system prompts are deliberately bulked past that threshold so the discount engages on call one, not call ten. Variable data (the specific claim under review) stays tiny in the user message, keeping cache hit rate near 100%.

Layer 2 — Upstash Redis exact-match dedup with single-flight coalescing

What it does: Deduplicates entire AI calls by content hash. Two users uploading the same receipt image never invoke Gemini twice. A worker that re-runs a claim already reviewed by Claude gets the cached verdict instantly.
The adaptive piece — single-flight coalescing: when N concurrent requests hit a cold cache key for the same input, exactly one origin call fires and the other N−1 callers await that single promise. Classic thundering-herd prevention, but built on an in-memory promise map that only works because of Fluid Compute (see below).
Where it applies:
- receipt:extract:{sha256} — 7-day TTL. Identical receipt images never re-invoke Gemini Vision.
- claude:review:{claimId} — 1-hour TTL. Worker retries on the same claim return the cached verdict.
- tweetquality:{tweet_id}:{impression_bucket} — 1-hour TTL. Tweet scoring only re-runs when engagement crosses a 1,000-impression threshold.
- mem0:profile:{wallet} — 15-minute TTL. Entity profile reads coalesce across pipelines.
Atomic guarantees: the rate-limit sibling (@upstash/ratelimit) uses a Lua-backed atomic INCR+EXPIRE. No race windows, no double-spend on rate-limit budgets.

Layer 3 — Vercel Data Cache (framework outputs)

What it does: Caches computed non-AI outputs at the framework level, separate from Upstash. Things like the synthesized knowledge-base context handed to Claude, the leaderboard, and distilled mem0 entity profiles.
How it works: Next.js unstable_cache persists function results in a Vercel-managed, edge-replicated cache with tag-based invalidation (revalidateTag('kb:context')). When an operator edits the knowledge base, every edge serving that tag drops the cached version and re-computes on the next read — no deploy, no manual cache bust.
Why two cache layers: Upstash is for application state, rate limits, and AI dedup. Vercel Data Cache is for framework outputs and tag-based workflows. They complement each other — one is authoritative, one is edge-local and framework-integrated.

Layer 4 — Vercel Edge Config (live-editable prompts)

What it does: Stores the stable prompt prefixes (oversight rulebook, fraud reasoning rules, receipt extraction schema) in an edge-replicated key-value store with sub-millisecond reads from every region.
Why it is a big deal: operators can tune prompts from the Vercel dashboard without a code push. Edits propagate across all edges in roughly one second. Bundled defaults ship with every deploy, so nothing breaks if Edge Config is unavailable — the fallback is baked into the binary.
Maximizes Layer 1: because every call reads the exact same byte-identical prefix, the provider-side prompt cache hit rate stays pinned near 100%. The two layers compound.

Fluid Compute — why the caches are real, not theoretical

Classic serverless gives every request its own isolate. That is fine for stateless work, but it is fatal for in-memory coalescing: the promise map that deduplicates concurrent cold-key misses can only work if requests share an instance. On classic serverless they do not.

GASCOIN runs on Vercel's Fluid Compute runtime, which warms a single instance and routes many concurrent requests through it — dozens simultaneously, with peaks above 250 per instance. That single fact turns three of our adaptive implementations from theoretical wins into measurable wins:

Single-flight coalescing works at runtime. When 50 concurrent requests hit a cold receipt hash, exactly one Gemini call fires. On classic serverless, 50 would fire.
Prompt cache warmth compounds. Repeated Gateway calls from the same warm instance reuse the provider-side prompt cache, maximizing hit rate.
p99 latency drops. No cold start penalty between related submissions, workers, or cron ticks.

Elastic concurrency is confirmed enabled on the platform project (defaultResourceConfig.fluid=true, elasticConcurrencyEnabled=true).

mem0 — context compression, not just memory

mem0 stores per-wallet trust trajectory and cross-pipeline signals as distilled summaries. This is the part of the stack that most teams underuse.

After every claim decision, a compressed fingerprint is written back to mem0 — verdict, risk bucket, tier, claim count, flags, one-line narrative. On the next claim for the same wallet, Claude receives that ~50-token summary instead of replaying the raw history from scratch. The stable system prompt stays cached while only the tiny summary varies, so the cache hit rate stays pinned near 100% and the variable input token count stays near zero.

This compounds with Layer 1. The Anthropic prompt cache does not care that the fraud rulebook is 5,000 tokens — it bills a fraction for it as long as the prefix is byte-identical. When the variable tail is 50 tokens instead of 500, the total bill for a claim review drops by roughly an order of magnitude.

The full adaptive implementation stack

Every item below is live in production, verified end-to-end with real tokens billed.

  ┌─────────────────────────────────────────────────────────────────────┐
  │  ROUTING                                                            │
  │  · Vercel AI Gateway — unified endpoint, auto-failover, budgets     │
  │  · OIDC auth — zero long-lived keys, auto-refreshed on deploy       │
  │  · Per-call tags — cost attribution per feature in dashboard        │
  │  · Fallback chains — e.g. Claude → Grok or Gemini on outage         │
  ├─────────────────────────────────────────────────────────────────────┤
  │  PROVIDER-NATIVE PROMPT CACHING (Layer 1)                           │
  │  · Anthropic cache_control on bulked system prompt   (90% off)      │
  │  · Google Gemini implicit + explicit cache            (75% off)     │
  │  · xAI Grok automatic cache on prefixes >1024 tok    (50-75% off)   │
  ├─────────────────────────────────────────────────────────────────────┤
  │  UPSTASH REDIS + SINGLE-FLIGHT COALESCING (Layer 2)                 │
  │  · receipt:extract:{sha256}            7d   — identical uploads     │
  │  · claude:review:{claimId}             1h   — worker retries        │
  │  · tweetquality:{tweetId}:{bucket}     1h   — metric stability      │
  │  · mem0:profile:{wallet}              15m   — entity profile reads  │
  │  · @upstash/ratelimit Lua-atomic           — no race windows        │
  ├─────────────────────────────────────────────────────────────────────┤
  │  VERCEL DATA CACHE (Layer 3)                                        │
  │  · unstable_cache(buildClaudeKBContext)  tag: kb:context            │
  │  · unstable_cache(leaderboard)           tag: leaderboard           │
  │  · revalidateTag() on admin edits — instant edge-wide invalidation  │
  ├─────────────────────────────────────────────────────────────────────┤
  │  VERCEL EDGE CONFIG (Layer 4)                                       │
  │  · prompts_claude_oversight_system       5699 chars, live-editable  │
  │  · prompts_grok_fraud_system             2965 chars, live-editable  │
  │  · prompts_gemini_receipt_system         2694 chars, live-editable  │
  │  · Bundled defaults fallback — deploy-safe                          │
  ├─────────────────────────────────────────────────────────────────────┤
  │  MEM0 CONTEXT COMPRESSION                                           │
  │  · writeDistilledProfile() after every claim decision               │
  │  · ~50-token summary replaces raw history on next review            │
  │  · Compounds with Layer 1 cache — stable prefix stays warm          │
  ├─────────────────────────────────────────────────────────────────────┤
  │  FLUID COMPUTE RUNTIME                                              │
  │  · Elastic concurrency — one warm instance, many requests           │
  │  · In-memory pendingFetches promise map — thundering herd gone      │
  │  · p99 latency stable across cron + user bursts                     │
  ├─────────────────────────────────────────────────────────────────────┤
  │  GOVERNANCE                                                         │
  │  · Hard monthly budget cap (gateway-enforced, HTTP 402 on breach)   │
  │  · Per-user and per-feature rate limits                             │
  │  · Audit logging — every call with provider, tokens, latency, tags  │
  │  · Verified-commit gate on merges to main                           │
  └─────────────────────────────────────────────────────────────────────┘

What this means in plain terms

No single point of failure. Three providers. Automatic failover. Graceful degradation at every layer.
Duplicate receipts are free. SHA-256 dedup means identical images never reach a paid API.
Worker retries are free. Claim review is cached by ID for an hour — a worker re-run costs zero tokens.
Live prompt tuning. Rulebooks are edited from a dashboard, changes live in seconds, deploy-safe fallbacks baked in.
Hard cost ceiling. Budget caps enforced at the gateway, not in app code. The worst-case monthly spend is bounded by a number the operator sets, not by user volume.
Full audit trail. Every AI call is logged with tags, token counts, provider, latency, user, and failover chain.
Zero long-lived API keys. OIDC authentication rotates automatically. Nothing to leak, nothing to rotate, nothing to revoke.

Safety posture — the system fails into safety

Every layer has an escape hatch. Edge Config unreachable? The bundled default prompt ships with every deploy. Gateway down? Graceful degradation to conservative defaults. mem0 unavailable? Claude reviews without compressed history. Upstash degraded? The origin call fires directly. At each boundary the pipeline prefers caution over permissiveness — a disabled AI layer never auto-approves a claim it could not evaluate.

No AI outage can block a legitimate payout. No cache outage can compromise fraud detection. No provider price move can surprise the treasury. The system is designed to fail into safety.

End-to-end flow

The flow is split across a synchronous submit path that runs inside the user's request, and an asynchronous worker path that runs on a cron boundary. Claude lives entirely on the async side — it is the final reviewer, not a parallel column.

  REQUEST                                                PAYOUT
     │                                                      ▲
     ▼                                                      │
  [ SYNCHRONOUS SUBMIT PATH ]                [ ASYNC WORKER PATH ]
     │                                                      │
     ├─► X API v2 ───────────┐                              │
     ├─► Solana holdings ────┤                              │
     ├─► Gemini Vision ──────┤                              │
     │        │              ▼                              │
     │        ▼       ┌──────────────┐                      │
     │    fraud.ts ──►│  13  GATES   │──► DB + audit ───────┤
     │        │       └──────────────┘                      │
     │        ▼                                             │
     │    Grok Reasoning (conditional)                      │
     │    (cross-signal, high-risk only)                    │
     │                                                      │
     │        ┌────────────────────────────┐                │
     │        │   CLAUDE OVERSIGHT         │◄── KB context  │
     │        │   (FINAL reviewer of       │◄── mem0 profile│
     │        │    Gemini + Grok + gates   │                │
     │        │    + mem0 + KB rulebook)   │───► mem0 write │
     │        └──────────┬─────────────────┘  (distilled)   │
     │                   │                                  │
     │                   ▼                                  │
     │        ┌────────────────────────────┐                │
     │        │   PAYOUT WORKER            │                │
     │        │   · re-verify X API        │                │
     │        │   · check mem0 ring flags  │                │
     │        │   · sendSolPayout (Helius) │                │
     │        └──────────┬─────────────────┘                │
     │                   │                                  │
     └───────────────────┴──────────────────────────────────┘

  Cached (transparently) at every stage:
    · Gemini receipt extraction  by SHA-256 hash   (7 days)
    · Claude oversight verdict   by claim ID       (1 hour)
    · mem0 entity profile        by wallet         (15 minutes)
    · KB context                 by signal shape   (tag-revalidated)

  Graceful degradation at every stage:
    · Strong fraud signal → short-circuit REJECT
    · Graceful fallback   → preserve deterministic gate verdict
    · Budget cap hit      → HTTP 402 → conservative defaults
    · Provider outage     → AI Gateway failover to next provider

Every submission fans out in parallel to the cheap signals (X API v2, Solana holdings) and the expensive signal (Gemini Vision receipt extraction). The fraud module conditionally escalates to Grok for cross-signal reasoning when the initial signals look suspicious. The 13-gate policy engine consumes everything, persists the decision, and returns the user their response. Auto-approved claims then cross an async boundary into the cron worker, where Claude acts as the final reviewer — receiving the Gemini output, the Grok verdict (if it ran), the gate results, the mem0 entity profile, and the knowledge base rulebook — before any SOL is dispatched. The payout worker does a final X API re-verify and a mem0 ring-flag check, then executes the on-chain transfer. mem0 is the horizontal bus that lets the referral and engagement workers influence the payout decision without being coupled to the submission path.

Verifiable. Auditable. Economically bounded. This is what it looks like when a protocol treats AI as infrastructure instead of a line item.

What is live right now

Vercel AI Gateway routing — verified end-to-end with real billable calls to Claude, Grok, and Gemini.
Fluid Compute with elastic concurrency — confirmed enabled on the project via the Vercel API.
Upstash Redis with @upstash/redis SDK, Lua-atomic rate limits, and single-flight coalescing.
Vercel Data Cache — wrapping the knowledge-base context builder and mem0 profile reads, with tag-based invalidation wired to admin edits.
Vercel Edge Config — three prompt prefixes live and readable, editable from the dashboard, with deploy-safe bundled fallbacks.
mem0 context compression — distilled wallet summaries written after every claim decision.
OIDC authentication — zero long-lived API keys in the codebase; auto-refreshed by the platform.
Verified-commit gate on every merge to main.
258 unit tests passing continuously.

This is the kind of infrastructure most projects talk about building someday. GASCOIN is shipping it now, in production, with a hard cost ceiling and a full audit trail.

AI Engine Architecture & Cost Defense