Executive summary
Detection and hunting are how defenders turn telemetry into decisions. Detection engineering codifies known bad (TTPs) into reliable, low-noise analytics. Threat hunting is the hypothesis-driven search for the unknown using context, anomaly signals, and analyst intuition. This playbook gives you a production-ready approach: what data to collect, how to shape it, patterns to detect, practical queries for Windows/Linux/Cloud, hunting workflows, quality gates, and KPIs.
1) Foundations: detection vs hunting
-
Detection engineering: repeatable analytics (“detections-as-code”) with tests, owners, deployment pipelines, and SLAs. Output = alerts.
-
Threat hunting: iterative investigations without waiting for alerts. Output = new detections, intel, hardening tasks.
Both share the same raw materials: telemetry → normalization → enrichment → analytics → action.
2) Telemetry strategy (Minimal Viable Telemetry)
Endpoint
-
Process: parent/child, full command line, integrity level, hashes, signer, image path.
-
File: create/rename/delete, entropy, extension/type mismatch.
-
Registry (Win): run keys, services, LSA providers.
-
Network: per-process flows (dst, port, bytes, JA3/JA4, SNI/host).
-
Memory: module loads, injection indicators (RWX,
VirtualAllocEx,CreateRemoteThread).
Identity & SaaS
-
Auth logs: success/fail, MFA, geo, device posture, risk flags.
-
OAuth: consent grants, new app registrations, token lifetimes/scopes.
-
Mail/Drive/Share: sharing changes, mass downloads/deletes.
Cloud
-
Control plane: IAM changes, policy updates, key usage.
-
Data plane: object access, egress byte deltas.
-
Compute: metadata service access, container exec/priv-esc, unusual images.
Network (sensor or cloud PCAP/flow)
-
DNS (query name, NXDomain rate, TTL), HTTP (host, path, UA), TLS (SNI, JA3/JA4), NetFlow.
Normalize with a schema (ECS/OSSEM) and time-sync everything (NTP). Enrich with asset/owner tags, GeoIP/ASN, threat intel, process reputation.
3) Detection engineering lifecycle
-
Hypothesis/TTP (map to ATT&CK sub-technique).
-
Data contract (fields required, sources).
-
Rule/analytic (KQL/SPL/Sigma/EQL).
-
Tests: unit (synthetic logs), replay (pcaps/evt), adversary emulation (Atomic Red Team).
-
Quality gates: data freshness, field completeness, cardinality limits, false-positive review.
-
Deploy with detections-as-code (Git + CI/CD). Track owner, SLA, MTTD/PPV (precision).
Tip: write detections around behaviors, not hashes. Hashes rot; TTPs persist.
4) Core behavioral patterns (with ready-to-use analytics)
A) Initial access & execution (Windows)
Suspicious PowerShell (download/execution, AMSI bypass attempts)
LOLBin abuse
-
rundll32,mshta,wmic,certutil,bitsadminlaunching network connections or scripts.
B) Persistence & privilege escalation
New auto-start extensibility points
Service installs from user-writable paths
C) Credential access & discovery
Suspicious LSASS access
Kerberoasting prep
D) Lateral movement
WMI/PSRemoting from workstations
E) Command & control / beaconing
Flow periodicity & low-and-slow
(Flag regular intervals with small jitter + small volume → beacon suspects.)
F) Exfiltration
DNS tunneling heuristic
Sudden egress spike to new ASN
G) Ransomware staging
-
Rapid file rename/write with high entropy, shadow copy deletion, suspicious backup/defender tampering.
5) Linux & macOS essentials
Linux: new listener by an unusual binary
Linux: privilege escalation surfaces
-
sudoersedits, setuid bit changes, unprivileged eBPF use,ld.so.preload,cronentries in user writeable dirs.
macOS: persistence
-
LaunchAgents/LaunchDaemons from
~/Library/LaunchAgents/with network reach-outs; unsigned binaries allowed via user click → hunt Gatekeeper bypass traces.
6) Cloud detections that matter
Azure AD risky impossible travel
AWS key misuse
-
AccessKey used from new ASN + S3 List/Get flood + CloudTrail DeleteTrail attempts → high-risk triage.
GCP service account drift
-
New key creation followed by BigQuery export → egress monitor around the key’s first use.
7) Threat hunting workflow (4-hour cycle)
-
Choose a seed: a TTP (e.g., DLL sideloading), an anomaly (new JA3), or new intel (domain set).
-
State a hypothesis: “We will find unsigned binaries sideloaded by Office spawning
rundll32with network egress.” -
Scoping queries: broad → narrow. Save notebooks (Jupyter + MSTICPy/SQL/SPL).
-
Pivot: by parent process, user, host, ASN, signer, hash cluster, JA3 cluster.
-
Document leads: promote to detection if repeatable; file hardening/IR tickets if real risk.
-
Retro hunt (30–90 days) for newly found IOCs/TTPs.
Add a hunt register: hypothesis, coverage, queries, outcomes, follow-ups.
8) Machine learning that actually helps
-
Outlier/anomaly: z-scores/Isolation Forest on per-host command counts, child-process trees, DNS lengths.
-
Beacon detection: spectral analysis (FFT) on inter-arrival times.
-
Clustering: group command lines (TF-IDF + HDBSCAN) to surface “weird” exec strings.
-
Graph features: user–host–process graphs; detect unusual edges.
Guardrails: strict explainability, feedback loops to analysts, and feature drift monitors. Use ML to prioritize and suggest pivots, not to auto-close cases.
9) Detections-as-code: quality & testing
-
Repo layout:
/detections/<domain>/<technique>/<rule>.yml(Sigma), with test fixtures, sample logs, owners. -
Pre-merge CI: schema lint, data-contract checks, simulated log replays, expected FP rate vs baseline.
-
Post-deploy canary: 1–5% of fleet; compare alert precision; auto-rollback if PPV < threshold.
Coverage KPIs
-
% ATT&CK TTPs with at least one high-confidence analytic.
-
Alert PPV (precision) per rule, MTTD, MTTR, time-to-contain.
-
Data completeness (non-null critical fields) and ingest latency.
10) Triage cheatsheet (first 10 minutes)
-
Confirm behavior: execution + network + persistence? (Need ≥2 to escalate.)
-
Scope blast radius: same user, same signer, same JA3, same ASN.
-
Kill-chain phase: access, discovery, C2, actions on objectives → match controls.
-
Decide action: isolate host / block token / revoke OAuth consent / disable key / block domain.
-
Create feedback: If benign, write suppression rule with rationale and expiry.
Appendices
A) Sigma example — Suspicious CertUtil
B) Zeek hunting cues
-
weird.log: excessive trunc/rexmit → C2/bad middleboxes. -
conn.log: periodicorig_pkts=1 resp_pkts=1pairs. -
dns.log: long labels, high NXDomain ratio.
Final word
Great security teams ship detections and iterate through hunts. Make telemetry trustworthy, codify behaviors, test relentlessly, and measure outcomes. Everything else is noise.
— CyberDudeBivash
