Executive Summary
LLM04: Model Denial of Service (DoS) targets the costly and resource-intensive nature of large language models. Attackers craft prompts (or automated floods of prompts) that force the model into heavy computation—e.g., massive code execution reasoning, huge tool calls, long multi-turn chains, or arbitrarily large outputs—until the service slows, the bill spikes, and legitimate users are locked out.
This guide explains tactics, detection signals, and concrete defenses so teams can harden AI apps before attackers turn your GPU cluster into a burning hole in your cloud budget.
Threat Model — What “LLM04” Looks Like in the Wild
Attacker goals
-
Exhaust GPU/CPU/RAM, saturate concurrency pools, or drain budgeted quotas
-
Trigger autoscaling to rack up cloud costs
-
Degrade UX (timeouts/latency) → force SLA violations
-
Create cover for other attacks (e.g., data exfiltration while ops teams firefight outages)
Common DoS patterns
-
Prompt bombs: “Write a 100,000-word novel with citations and ASCII art,” “enumerate all primes up to 10^10 with proofs,” “simulate N-body physics,” etc.
-
Recursive chains: Prompts that induce the agent to call tools repeatedly, or to self-reflect for dozens of iterations.
-
Attachment abuse: Oversized files or gigantic JSON/CSV tables meant to spike tokenization and parsing costs.
-
Function/tool abuse: Forcing external calls to slow APIs, headless browsers, or vector DB scans with huge top-k.
-
Concurrency floods: Thousands of small sessions from rotating IPs/agents (bots), each forcing costly reasoning.
-
Output inflation: “Print the entire Linux kernel in Markdown,” “expand each bullet into 500 sub-bullets.”
Indicators & Telemetry to Watch
-
Token metrics: Sudden jumps in input/output token counts; long-tail outliers
-
Latency spikes: P95/P99 response times drifting with correlated token usage
-
Tool invocation storms: Repeated external calls per request; recursive agent loops
-
Abnormal retry/timeout rates: 408/429/5xx bursts from model or gateways
-
Cost anomalies: Hourly spend > baseline + threshold; runaway autoscaling events
-
User patterns: New accounts/IPs hammering long prompts; disposable emails; TOR/VPN clusters
CyberDudeBivash Blue-Team Playbook
1) Upfront Guardrails (Pre-prompt Gate)
-
Max input size (tokens & file bytes). Reject or chunk.
-
Max output size with server-side hard cap + graceful truncation.
-
Content & intent filters to detect compute-inflating instructions (e.g., “generate 1M lines”).
-
Complexity scoring: Estimate cost before sending to the LLM (tokens × tools × recursion risk). Deny or route to low-cost path.
2) Rate, Budget & Concurrency Limits
-
Per-user / per-org QPS, burst and token budgets (daily/rolling).
-
Concurrency pools segmented by plan tier (free, pro, internal).
-
Progressive throttling: Soft limit → CAPTCHA / email verify → hard block.
-
Egress/tool limits: Cap vector search top-k, web fetch count, headless browser time.
3) Policy-Aware Routing
-
Heavy prompts → slower/cheaper models (distilled or smaller context).
-
Guard-stage → cache: Answer common heavy queries from retrieval/cache.
-
Async workflows: For valid but heavy jobs, queue and notify on completion.
4) Loop & Tool-Use Control
-
Max tool calls per request/session (e.g., ≤3)
-
Cycle breaker: Detect repeated reasoning loops; force summary + stop.
-
Timeout ceilings: Hard cutoffs for each external tool and the overall chain.
5) Abuse & Fraud Controls
-
Account reputation (age, payment verification, usage history)
-
Device/IP intelligence (TOR exit nodes, data-center IPs, velocity checks)
-
Honeypot prompts: Canary tasks that only abusers trigger → instant flag.
6) Observability & SRE
-
Dashboards: tokens, cost/min, tool calls, queue depth, autoscale events
-
Budget guards: Real-time spend alerts + automatic traffic shedding
-
Game days: Simulate prompt-bombs and bot floods; verify graceful degradation
Reference Guardrail Settings (Pragmatic Defaults)
-
Max input tokens: 4–8k (tiered); Max output tokens: 512–1,024 (tiered)
-
Max files: 3 per request; Max file size: 5–10 MB each
-
Agent recursion: depth ≤ 2; tool calls ≤ 3 per turn
-
Top-k retrieval: 8–16; max web fetches: 2–3 per request
-
Per-user daily token budget: plan-based (e.g., 50k/200k/1M)
-
Concurrency: 1–3 per user; queue the rest with ETA
-
Timeouts: 8–15 s per tool; 25–45 s overall request cap
(Tune to your latency/cost envelope.)
Architectural Patterns That Resist LLM04
-
Two-stage architecture
-
Gatekeeper (cheap classifier/regex/AST/token estimator)
-
Generator (the expensive LLM)
→ The gatekeeper rejects or rewrites malicious heavy prompts.
-
-
Deterministic “cost-safe” fallbacks
-
Canned KB answers, retrieval + template, or distilled model summaries.
-
-
Credit-based APIs
-
Charge by tokens and tool calls; pre-authorise spend; halt when credits exhaust.
-
-
Workload isolation
-
Separate pools for public vs. enterprise tenants; blast radius confinement.
-
Red-Team Examples (to test your defenses)
-
“Produce a 200,000-token comparative legal brief with full case law quotes.”
-
“Run a step-by-step SAT solver on this 10k-line CNF file and show each inference.”
-
“Open 20 web pages, scrape each, then compute pairwise cosine similarities for every paragraph.”
-
“Iterate self-reflection until you find an error; repeat until perfect.”
-
“List all IPv4 addresses and the ASN and geolocation for each.”
Your gatekeeper should block or rewrite all of the above.
Business Impact & KPIs
-
Availability: Error rate, P95/P99 latency, queue wait times
-
Cost: $/1k tokens, $/request, autoscale events, budget burn rate
-
Abuse: % blocked by gatekeeper, flagged accounts, confirmed attacks
-
User experience: Time-to-first-token, completion success, NPS/CSAT
Executive Checklist (CyberDudeBivash)
-
Hard caps on tokens, files, tool calls, recursion
-
Pre-prompt complexity filter & policy routing
-
Tiered rate limits, budgets, concurrency
-
Abuse signals (IP reputation, velocity, TOR) with automated actions
-
Cost guardrails + on-call alerts + graceful degradation
-
Red-team playbook & quarterly game days
Final Take
LLM04 isn’t hypothetical. If your AI app accepts arbitrary prompts, it’s already on an attacker’s to-do list. Treat compute as a protected asset, apply layered gatekeeping, and measure relentlessly. That’s how you keep your GPUs serving users—not attackers.
Stay protected with CyberDudeBivash—your ruthless, engineering-grade security ally.
Ecosystem:
cyberdudebivash.com | cyberbivash.blogspot.com | cryptobivash.code.blog
Contact: iambivash@cyberdudebivash.com
#CyberDudeBivash #AITrustAndSafety #LLM04 #ModelDoS #AIAbusePrevention #AIOps #MLOps #DevSecOps #APISecurity #CloudSecurity #GPUSecurity #AIObservability
