AI & NLP for Threat Intelligence (2025): Automate IOC/TTP Extraction, Summaries & ATT&CK Mapping By CyberDudeBivash • September 21, 2025 (IST)

 


TL;DR 

  • What you’ll build: an end-to-end CTI pipeline that ingests reports/feeds → extracts IOCs & TTPs → normalizes/dedupes → maps to MITRE ATT&CK → publishes STIX 2.1 to your TIP (MISP/OpenCTI) and pushes detections to SIEM/SOAR. ATT&CK is your lingua franca for adversary behavior. MITRE ATT&CK+1

  • Why now: mature building blocks exist—spaCy/HuggingFace for NER, STIX/TAXII 2.1 for exchange, MISP/OpenCTI for knowledge graphs, ATT&CK Navigator for coverage views. MITRE ATT&CK+5spacy.io+5Hugging Face+5

  • Business win: shrink report-to-detection from days to minutes; measure precision/recall on extractions and coverage deltas per ATT&CK technique. (Use CISA’s mapping practices to keep analysts honest.) CISA


1) What problems AI actually solves in CTI

  • Speed: OCR/PDF → clean text → IOC/TTP extraction and entity linking at stream speed.

  • Normalization: inconsistent formats → STIX 2.1 objects (Indicator, Malware, Intrusion Set, Relationship). OASIS Open+1

  • Prioritization: summarize long reports; rank IOCs by observed-in and confidence; map to your detection gaps using ATT&CK. MITRE ATT&CK

  • Distribution: auto-publish via TAXII 2.1 to TIPs and subscribers. docs.oasis-open.org+1


2) Reference pipeline 

Ingest → Parse → NER/IOC extract → Validate → Normalize & De-dup → TTP extraction → ATT&CK mapping → STIX 2.1 pack → TAXII publish → SIEM/SOAR actions

2.1 Ingest & parsing

  • Accept PDF/HTML/blog/TWITTER/X feeds. Strip boilerplate; preserve line breaks for pattern-based cues (e.g., command blocks).

2.2 IOC extraction (NER + rules)

  • Use spaCy (fast, customizable) + Hugging Face token-classification models for domain/IP/hash/URL/CVE tags; backstop with regex/heuristics for high-precision patterns. spacy.io+1

  • Validate with shape checks (IPv4/6, TLD list), sinkhole typo-squats, and active DNS lookups (quarantined).

2.3 TTP extraction (behavior → techniques)

  • Pattern library for common textual cues → ATT&CK techniques; e.g., “mimikatz/lsass dump” → Credential Dumping (T1003); “regsvr32 /s /u /i:http” → Signed Binary Proxy Execution (T1218). Use ATT&CK technique pages as your source of truth. MITRE ATT&CK

  • Apply weak/medium/strong mapping rules and keep analyst review in the loop (see §5).

2.4 Normalize & de-dup

  • Canonicalize domains (evil[.]comevil.com), hashes, and CVEs; merge by observable keys; attach source and confidence.

2.5 Package & publish

2.6 Close the loop

  • Use ATT&CK Navigator layers to visualize what techniques the intel covers vs your detections. Feed gaps to your SIEM/SOAR backlog. MITRE ATT&CK


3) Minimal working example (Python)

3.1 Extract IOCs with spaCy + Transformers

# pip install spacy transformers rapidfuzz tldextract import re, tldextract, hashlib from rapidfuzz import fuzz from transformers import pipeline ner = pipeline("token-classification", model="dslim/bert-base-NER") # HF example IOC_PATTERNS = { "ip": re.compile(r"\b(?:(?:25[0-5]|2[0-4]\d|[01]?\d?\d)(?:\.|$)){4}\b"), "sha256": re.compile(r"\b[A-Fa-f0-9]{64}\b"), "md5": re.compile(r"\b[A-Fa-f0-9]{32}\b"), "url": re.compile(r"\bhttps?://[^\s)]+") } def extract_iocs(text: str): out = {"ip": set(), "hash": set(), "url": set(), "domain": set(), "cve": set()} # Rule-based for k, pat in IOC_PATTERNS.items(): for m in pat.findall(text): if k == "ip": out["ip"].add(m.strip(".")) elif k in ("sha256", "md5"): out["hash"].add(m.lower()) elif k == "url": out["url"].add(m) # Domains from URLs and plain text for u in list(out["url"]): ext = tldextract.extract(u) if ext.domain and ext.suffix: out["domain"].add(f"{ext.domain}.{ext.suffix}".lower()) # Lightweight CVE out["cve"].update(re.findall(r"CVE-\d{4}-\d{4,7}", text, flags=re.I)) return {k: sorted(v) for k, v in out.items()}

(Hugging Face “token-classification”/NER pipeline & docs shown for reference.) Hugging Face+1

3.2 Map text snippets to ATT&CK techniques (heuristics)

ATTACK_RULES = [ (r"mimikatz|sekurlsa|lsass", "T1003"), # Credential Dumping (r"regsvr32.*(http|https)", "T1218.010"), # Regsvr32 proxy exec (r"powershell.*-enc", "T1059.001"), # PowerShell (r"rundll32.*url|dllhost.*url", "T1218"), # Signed Binary Proxy Exec (r"certutil.*-urlcache|-decode", "T1105"), # Ingress Tool Transfer ] def map_ttps(text: str): hits = {} for pat, tech in ATTACK_RULES: if re.search(pat, text, flags=re.I): hits[tech] = hits.get(tech, 0) + 1 return [{"technique": t, "evidence": f"rule:{n}"} for t, n in hits.items()]

(Use MITRE ATT&CK technique catalog to validate mappings & keep rules refreshed.) MITRE ATT&CK

3.3 Emit a STIX 2.1 Indicator bundle (simplified)

import json, uuid, datetime as dt def stix_indicator(ioc: str, ioc_type: str, label="malicious-activity", conf=70): return { "type": "indicator", "spec_version": "2.1", "id": f"indicator--{uuid.uuid4()}", "created": dt.datetime.utcnow().isoformat()+"Z", "modified": dt.datetime.utcnow().isoformat()+"Z", "name": f"{ioc_type}:{ioc}", "pattern_type": "stix", "pattern": f"[{ioc_type}:value = '{ioc}']", "confidence": conf, "labels": [label] } def stix_bundle(iocs): objs = [] for ip in iocs["ip"]: objs.append(stix_indicator(ip, "ipv4-addr")) for d in iocs["domain"]: objs.append(stix_indicator(d, "domain-name")) for h in iocs["hash"]: objs.append(stix_indicator(h, "file:hashes.'SHA-256'")) # adapt per hash type return {"type": "bundle", "id": f"bundle--{uuid.uuid4()}", "objects": objs} # Example # text = open("report.txt").read() # bundle = stix_bundle(extract_iocs(text)) # print(json.dumps(bundle, indent=2))

(STIX 2.1 is the current exchange standard for CTI; see the OASIS spec & examples.) OASIS Open+1

3.4 TAXII 2.1 publish (conceptual)

  • POST your bundle to a TAXII 2.1 collections/{id}/objects endpoint with an API token. (See OASIS TAXII 2.1 for REST details.) docs.oasis-open.org

  • On the receiving end, MISP or OpenCTI ingests and enriches (sightings, relationships, graph). misp-project.org+1


4) Integrations that matter (and why)

LayerToolingWhy it helps
TIPOpenCTI, MISPKnowledge graphing, STIX in/out, connectors, collaboration. docs.opencti.io+1
ExchangeSTIX 2.1 / TAXII 2.1Vendor-neutral, standards-based sharing/publishing. OASIS Open+1
Mapping/coverageMITRE ATT&CK + NavigatorNormalized TTPs and visualization of detection gaps. MITRE ATT&CK+1
ExtractionspaCy, Transformers (HF)Production-grade NER + customizable models. spacy.io+1

5) Human-in-the-loop (HITL) keeps you honest

  • Analyst review gates: promote items to “published” only after a short check of precision (especially TTP mappings).

  • CISA’s ATT&CK mapping guidance: avoid “wishful mapping” and biases; require evidence strings linking text to technique IDs. CISA

  • Feedback loops: false positives go back to training (regex tweaks, prompt updates, model fine-tuning).


6) Quality & ROI: measure these, or it didn’t happen

  • Extraction P/R/F1 for IOCs & TTPs (label 200–500 sentences; update quarterly).

  • Latency: ingest→publish p50/p95.

  • Coverage delta: techniques with active detections before vs after intel import (Navigator layer diff). MITRE ATT&CK

  • SOC impact: time saved per case, auto-enrichment hit rate, ratio of auto-closed low-risk alerts.

  • Cost to value: GPU/CPU time vs analyst hours saved.


7) Production safeguards

  • Confidence scoring & source weighting (vendor reputation, age, sightings).

  • De-dup & decay: older IOCs auto-downgrade unless re-sighted.

  • Toxic data filters: block “copy-pasted” attack chains from Reddit/unknown gists without corroboration.

  • Tenant-aware exports: separate workforce vs customer intel where licensing requires it.


8) 30/60/90-day rollout

Days 1–30 (Pilot)

  • Stand up OpenCTI or MISP; wire TAXII input, attach a small set of trusted sources. docs.opencti.io+1

  • Ship IOC extraction + basic ATT&CK heuristics; publish STIX 2.1 to a sandbox collection. OASIS Open

  • Start a 200-sentence golden set for evaluation.

Days 31–60 (Harden)

  • Add HITL UI, confidence tiers, and auto-dedup; enrich with WHOIS/passive DNS; auto-create Navigator layers for coverage reviews. MITRE ATT&CK

  • Begin SIEM/SOAR wiring: blocklists for high-confidence IOCs; analytics for common techniques.

Days 61–90 (Operate)

  • Expand TTP rules; add model fine-tuning for domain-specific jargon; schedule weekly metrics; open TAXII to internal consumers. docs.oasis-open.org


9) Playbooks 

IOC → Action (high-confidence)

  1. Publish STIX Indicator (+ Sighting if seen).

  2. Create SOAR task to block (URL/IP/hash) and hunt last 30 days.

  3. Expire after N days without sightings.

TTP → Action

  1. Add ATT&CK technique to Navigator; check detection gap. MITRE ATT&CK

  2. If gap exists: create SIEM rule/sigma/JEA script task.

  3. Backfill search & case.


10) Build vs buy (fast guidance)

  • Buy platform; build extractors. Most teams win with a commercial/open TIP + custom NLP on top.

  • Red flags: no STIX/TAXII, no ATT&CK alignment, black-box ML without feedback loops, no export to SIEM/SOAR.


FAQs

Is LLM summarization safe for CTI?
Yes—with prompt constraints, source citations, and a human approval step for high-impact summaries.

Why not rely only on regex?
Rules give precision; ML adds recall and generalizes to unseen formats. Use both.

Can we auto-map techniques?
Use weak/strong evidence tiers + analyst review. CISA’s paper highlights common mapping errors—treat it as policy. CISA


Sources & primers

#CyberDudeBivash #ThreatIntelligence #NLP #AI #CTI #IOC #TTP #MITREATTACK #STIX #TAXII #MISP #OpenCTI #SIEM #SOAR #XDR #SOCAutomation #OSINT #Summarization #EntityRecognition

Comments

Popular posts from this blog

CyberDudeBivash Rapid Advisory — WordPress Plugin: Social-Login Authentication Bypass (Threat Summary & Emergency Playbook)

Hackers Injecting Malicious Code into GitHub Actions to Steal PyPI Tokens CyberDudeBivash — Threat Brief & Defensive Playbook

Exchange Hybrid Warning: CVE-2025-53786 can cascade into domain compromise (on-prem ↔ M365) By CyberDudeBivash — Cybersecurity & AI