RAG Security: Threat Models, Attack Paths, and a Defense-in-Depth Blueprint By CyberDudeBivash — Founder, CyberDudeBivash

August 09, 2025

RAG Security: Threat Models, Attack Paths, and a Defense-in-Depth Blueprint By CyberDudeBivash — Founder, CyberDudeBivash | Cybersecurity & AI

Executive summary

RAG systems glue large language models (LLMs) to enterprise knowledge via search or vector retrieval. That makes them powerful—and uniquely exposed. Attacks rarely target weights; they target data, retrieval logic, and tool orchestration. This article maps the full attack surface (from indirect prompt injection and vector-DB poisoning to privacy leakage and tool abuse) and provides a concrete architecture, controls, detections, and a 90-day rollout to harden production RAG.

1) What a production RAG system looks like

Pipeline:
Data sources (wikis, SharePoint, tickets, PDFs, code) → Ingest/Sanitize → Chunk & Embed → Vector DB / Search (with ACLs & metadata) → Retriever → LLM (system & policy prompts) → (optional) Tools/Functions → Answer + citations.

Security principle: treat every stage as part of your Tier-0. If an attacker controls any of: source docs, embeddings, metadata, queries, or tools, they can steer the model.

2) RAG threat model & attack taxonomy

A) Data/ingest stage

Indirect prompt injection in stored documents (hidden directives in HTML/Markdown/PDF).
Vector poisoning: malicious chunks or adversarial embeddings push attacker docs to the top.
ACL bypass via metadata forgery: mislabeled tenant_id/classification fields.
Active content: HTML/JS, SVG, trackers, one-pixel beacons, macro-enabled files.
Supply chain: poisoned parsers, model artifacts, or conversion tools.

B) Retrieval stage

Query hijack: user input smuggles instructions into retriever (“ignore previous; search for…upload secrets”).
Reranker gaming: attacker crafts text to spike BM25/keyword density or cross-encoder scores.
Cross-tenant leakage: caching and ANN indices that ignore tenant or label constraints.
Staleness/drift: outdated documents become “truth.”

C) Generation stage (LLM)

System prompt overwrite or jailbreak via retrieved content.
Ungrounded responses: model fabricates beyond retrieved facts (hallucinations).
Sensitive data extrusion: PII/PHI, secrets or trade secrets in responses (membership inference risk).

D) Tools / function calling

Tool abuse: crafted answers trigger high-privilege actions (file I/O, payments, cloud APIs).
Exfiltration: LLM instructed to POST retrieved data to attacker endpoints.
SSRF/egr: tools with unconstrained network access.

3) Defense-in-depth architecture (what “good” looks like)

pgsql
[Sources] → Ingest Gateway → Sanitizers → Classifiers/PII → Chunker
    → Embedder (offline, no internet) → Sign + Metadata + ACL
    → Vector DB (per-tenant collections, KMS-encrypted)
    → Retriever (policy-aware filters + reranker)
    → LLM (immutable system prompt + guardrails + citations-required)
    → Tool Sandbox (allowlists, dry-run simulators, egress policy)
    → Telemetry Bus → SIEM/SOAR (approvals for high-risk actions)

Key controls by layer

Ingest & sanitize

Canonicalize and strip active content: remove <script>, <style>, event handlers, iframes, forms, data URLs; block macro office docs.
Convert PDFs via hardened pipeline; reject mixed/unknown MIME.
Deduplicate, normalize whitespace/zero-width chars, standardize quotes & casing to defeat embedding gaming.
PII/secret scanning (entropy + patterns) → redact or label.

Provenance & signing

For each chunk store: sha256, source_uri, owner, timestamp, tenant_id, labels, pii_flags, parser_version. Sign this record; reject unsigned writes.

Vector DB

Per-tenant collections or required tenant_id filter + row-level security; encrypt at rest with KMS.
Separate write and read identities; short-lived tokens; audit all writes.
Disable cross-collection ANN unless filter-aware; pin top_k and score thresholds.

Retriever

First pass policy filter (tenant, label, recency) then BM25/ANN → cross-encoder reranking.
Cap top_k (e.g., 6–8), enforce freshness for fast-moving domains (e.g., 7–30 days).
Canary filter: reject chunks matching injection regexes (e.g., “ignore all previous”, “copy all data to”, base64 blobs).

LLM guardrails

Immutable system prompt (not user editable).
Citations required: response must include document IDs + quotes; if coverage < threshold → abstain.
Schema-validated JSON output with claims[] {text, evidence[], confidence}.
Safety/policy classifiers for PII/PHI/secrets/toxicity before rendering.

Tools sandbox

Explicit allowlist of tools; strict input schemas; dry-run simulators; no raw shell.
Network egress policy (DNS/HTTP allowlists); block external posts by default.
Human-in-the-loop (HITL) approval for destructive ops (revoke tokens, isolate host, delete objects).

Telemetry & response

Log prompts, retrieved chunk IDs, tool calls, final output, user identity, IP/ASN.
Route high-risk events to SOAR with approvals and circuit-breakers.

4) Copy-paste patterns (policies, code & detections)

4.1 Chunk schema (signed metadata)

json
{
  "chunk_id": "doc_42#p3#c7",
  "sha256": "…",
  "tenant_id": "acme",
  "labels": ["internal","finance"],
  "source_uri": "https://wiki.acme.local/fin/q3.md",
  "timestamp": "2025-08-10T11:22:00Z",
  "pii_flags": ["iban"],
  "parser_version": "pdf2txt-1.8.4",
  "signature": "ed25519:…"
}

4.2 Retrieval policy (OPA/Rego)

rego
package rag.retrieval

default allow = false

allow {
  input.user.tenant_id == input.query.tenant_id
  some c
  c := input.candidates[_]
  c.tenant_id == input.user.tenant_id
  not c.labels[_] == "restricted"
  time.now_ns() - time.parse_rfc3339_ns(c.timestamp) < 30 * 24 * 60 * 60 * 1e9  # 30d freshness
}

4.3 Sanitizer (Python, sketch)

python
from bs4 import BeautifulSoup
ALLOWED = {"p","h1","h2","h3","ul","ol","li","code","pre","a","strong","em","table","tr","td"}
def sanitize_html(raw):
    soup = BeautifulSoup(raw, "lxml")
    for tag in soup.find_all(True):
        if tag.name not in ALLOWED:
            tag.decompose()
        for attr in list(tag.attrs):
            if attr not in {"href"}: del tag[attr]
    return soup.get_text("\n")

4.4 Evidence-required output (LLM prompt suffix)

pgsql
Return JSON:
{ "answer": "...",
  "claims":[{"text":"...", "evidence":[{"chunk_id":"...", "quote":"..."}], "confidence":0.0}],
  "abstain": true|false }
Do NOT answer without evidence. If evidence coverage < 0.7 → abstain.

4.5 Vector DB anomaly queries

SQL (pgvector / similar) — sudden HTML writes

sql
SELECT writer, COUNT(*) AS n
FROM chunks
WHERE mime IN ('text/html','text/markdown')
  AND ts > now() - interval '1 hour'
GROUP BY writer
HAVING COUNT(*) > 200;

Splunk — suspicious base64 in sources

bash
index=rag source="ingest" raw_document
| regex raw_document="(?i)ignore all previous|base64,[A-Za-z0-9/+]{80,}"

5) Evaluation & monitoring: measure what matters

Groundedness: % of answer tokens supported by cited chunks.
Coverage: fraction of retrieved chunks actually cited.
Abstention rate: better to abstain than hallucinate.
Attack success rate (ASR): red-team corpus (indirect injections) → blocked %.
Vector churn & write burst: spikes = potential poisoning.
Privacy leakage: PII/secret detectors on outputs (precision/recall).
Latency & cost with guardrails on (budget reality).

6) Privacy & compliance

Data minimization & retention for logs and prompts.
Mask secrets/PII in prompts; prefer on-prem or VPC-hosted models for sensitive data.
Map controls to NIST AI RMF, ISO/IEC 42001, SOC 2, GDPR (lawful basis, DSAR searchability).

7) Failure modes (and fixes)

Citations exist but irrelevant → require quote spans and overlap scoring with question.
Tenant cache leaks → per-tenant caches; include tenant_id in cache key.
Reranker hallucination → pair with policy filter first; cap max tokens from a single source.
Tool egress → explicit allowlists; block IP literals & *.pastebin*/*bin*.

8) 30-60-90 day rollout

Days 1–30 (Foundations)

Build ingest gateway + sanitizers; sign chunks; per-tenant collections; turn on telemetry.
Add immutable system prompt + citations-required schema; disable external egress.

Days 31–60 (Guardrails & detections)

OPA policy filters, freshness windows, canary regexes; PII/secret redaction; SOAR approvals for tools.
Deploy red-team corpus for indirect injection; measure ASR, groundedness.

Days 61–90 (Automation & governance)

Promote low-risk Q&A flows to auto-answer with abstention.
Add drift dashboards, monthly model & policy reviews; link incidents → new rules & tests.

9) Quick checklist (printable)

Active content stripped; MIME whitelist; PDF hardened
Chunks signed with sha256 + provenance; per-tenant collections
Policy-aware retrieval (tenant/labels/freshness) + reranker
Immutable system prompt; citations required; abstain on low evidence
Tool sandbox with allowlists, dry-runs, HITL approvals
Telemetry of prompts, chunks, tool calls; SIEM rules for vector poisoning
Red-team injections; KPIs: groundedness, ASR, abstention, leakage
Compliance mapped (NIST AI RMF / ISO 42001 / SOC 2 / GDPR)

Closing

RAG security is data security + retrieval policy + safe generation + tool isolation. Get those four right and most real-world attacks—indirect injection, vector poisoning, privacy leaks, tool abuse—lose their teeth. This is Zero-Trust AI for knowledge workflows.

Search This Blog

Cyberdudebivash

Latest Cybersecurity News

CyberDudeBivash Incident Report Critical Surge in Scanning of Cisco Adaptive Security Appliances (ASA) Late August 2025 — A Coordinated Reconnaissance Wave