When AI Attacks — Digital Content Series #3

"Nobody secured what you were saying on the other side of your perimeter."

You spent eighteen months and millions of dollars training your proprietary model. An attacker just replicated it in three weeks — without touching your infrastructure. They used your own API.

The Scenario

A financial services firm deploys a proprietary AI underwriting model — fine-tuned on twenty years of claims data, calibrated to regulatory requirements, integrated into every customer-facing loan decision. The model is their competitive edge. It lives behind an authenticated API.

Over four months, an attacker sends 2.3 million carefully crafted queries to that API — systematically designed to probe decision boundaries and reveal how the model reasons about income thresholds, collateral risk, and edge cases the firm spent years teaching it to handle. Each response becomes a training pair. By month three, their surrogate model matches the firm's on 94% of decisions. By month four, it outperforms it on edge cases — because the attacker deliberately over-represented them.

No malware. No lateral movement. No alert fired. The API behaved exactly as designed. The attack surface was the product itself.

Three Perspectives

The Trusted Leader

"I approved the API access. I never thought about what someone could learn just by asking it questions — at scale, systematically, over months. The model was the asset. I didn't treat the API like a door to it."

"No one in the deployment conversation asked: what can someone extract just by querying this thing? That question never made it to the table. It should have been the first one. We classified the training data. We never classified what the model learned from it. Those are not the same thing, and our policies treated them as if they were.

When the incident surfaced, three different teams each assumed someone else owned it. Security said IP. Legal said security. The AI product team said they built what was specced. The model went to production. The conversation ended. The exposure didn't."

The Defender

"The first time I walked this scenario into a board risk discussion, the reaction was: 'That's an IP problem, not a security problem.' That framing needs to die."

"Model distillation is a security incident with IP consequences. The attack surface is your API. The exfiltration channel is the response payload. The stolen asset is encoded knowledge your organization spent significant capital producing. Treating it as a legal matter after the fact is exactly how you lose the race.

Detection is the hard part. Your DLP, WAF, and SIEM have no concept of adversarial query patterns. Our ML engineer put a number on it: if your API returns confidence scores alongside predictions, you are giving the attacker 10–30x more signal per query than a hard-label-only API. We were returning logit-level outputs. We were handing them an accelerant.

Rate limiting, query watermarking, semantic anomaly detection — these controls exist. Most production deployments implement none of them. This is a serving infrastructure problem, and it is solvable. But only if you treat it as one before the incident."

The Attacker

"Your API is a confession booth. You answer every question I ask. You rate-limited me at a thousand queries a day. I had four months and nothing else to do."

"Modern attacks don't use random queries. Active learning strategies identify high-information boundary regions and concentrate the query budget there. You're finding the minimum set of queries that maximally constrains a surrogate model's parameter space. For transformer-based models, you can layer membership inference on top — probe whether specific data points were in the training set. Two-stage attack: steal the capability, then probe for the proprietary data that generated it.

By month four, our surrogate outperformed the original on edge cases. They spent years teaching it those. We just asked enough questions."

Technical Assessment

How the Attack Progresses

Model distillation follows a structured kill chain — reconnaissance, corpus construction, surrogate training, validation, deployment — without a single traditional indicator of compromise. The attacker enumerates the API schema, identifies whether it returns hard labels or soft probabilities, then runs an active learning loop: queries selected to maximize information gain about decision boundaries per API call consumed. Input-output pairs become training data. The surrogate is iteratively refined until fidelity is sufficient for deployment — as a competitive product, for regulatory arbitrage, or for direct resale.

What Comes Next: Model Farming

Model distillation is the technique. Model farming is what happens when it gets industrialized — rotating authenticated accounts, distributed cloud infrastructure across providers, automated surrogate training pipelines running against dozens of target APIs simultaneously. The attack stops being a targeted operation and becomes a scalable extraction business. That threat profile, its detection challenges, and its sector-level implications are the subject of Digital Content #4.

Detection Gaps

Standard monitoring misses this because the signals are semantic, not volumetric. What to look for: query distributions that systematically explore edge cases, anomalously uniform formatting suggesting programmatic generation, progressive boundary-probing where a single variable shifts incrementally across queries, and high diversity relative to the client's stated use case. None of these are visible without semantic analysis of query content. Volume-based rate limiting alone is insufficient — a patient attacker distributes across accounts and slows the timeline to stay below any threshold.

Model Distillation & Model Farming: Diamond Model of Intrusion Analysis

— Debrief —

CISO Debrief

"Model extraction attacks have been demonstrated against production deployments of commercial vision APIs, language models, and tabular prediction systems — achieving greater than 90% fidelity surrogates against real production APIs. This is not theoretical."

Your immediate exposure comes down to two questions: do you operate externally accessible model APIs, and do those APIs return anything beyond hard-label predictions? If you return confidence scores or probabilistic output, your information leakage per query is substantially higher than you assumed when the API was designed.

The governance exposure is equally urgent. Your data governance program has no classification category for model behavior as a protectable asset. Your breach definition is built around data records — which means a model extraction attack may not trigger your IR program, your legal notification thresholds, or your board reporting criteria. The training data is classified. The knowledge extracted from it is not. Those are not the same thing. Both gaps need to close: technical controls stop the extraction, governance defines what was stolen and who owns the response.

IR Directives

Inventory every externally accessible model API endpoint — including shadow deployments, partner integrations, and internal tools inadvertently exposed. Assume the list you have is incomplete.

Establish query volume baselines per authenticated client. Anomaly detection against those baselines is your earliest viable detection signal. Do this before you have an incident, not after.

Audit API response payloads. If you are returning logit-level outputs, confidence scores, or probability distributions, assess whether that information is required for client functionality — or whether hard-label responses are operationally sufficient. The difference in distillation efficiency is an order of magnitude.

Implement query watermarking where feasible. Imperceptible perturbations to model outputs can be used to fingerprint extracted surrogates and establish provenance in post-incident attribution.

Engage legal and data governance on the regulatory surface. If your model encodes personal data, a surrogate derived from it may carry data protection obligations regardless of how it was obtained.

Define what a model extraction incident looks like in your IR playbook. Most playbooks don't have one. Write it before you need it.

Close the Governance Gap

Classify deployed models as protectable assets. Your data governance framework classifies records, PII, and documents. It almost certainly has no category for model behavior — the decision boundaries, calibrated reasoning, and domain-specific tuning embedded in a production model. Add one. Define ownership. Define what constitutes a breach of that asset class.

Assign a named owner for model security posture. When a model goes to production, the governance conversation should not end. Someone needs to own the ongoing security posture of that deployed model — not just the infrastructure it runs on, but what it reveals through interaction. If that role doesn't exist in your org chart, you have an accountability gap an attacker is already aware of.

Update your breach definition. If your incident response and legal notification thresholds are defined around data records accessed or exfiltrated, a model extraction attack may not meet the trigger criteria — even if an attacker just replicated your most valuable proprietary system. Work with legal to establish what constitutes a reportable model IP incident before regulators define it for you.

Run a cross-functional accountability exercise. Put security, legal, and the AI product team in a room and ask: if our underwriting model were extracted through the API today, who owns the response? If there is hesitation, finger-pointing, or silence — that gap is your highest-priority governance finding. The attacker already knows the answer. You shouldn't have to discover it during an incident.

The Multi-Agent Multiplier

Agentic AI architectures significantly expand the distillation attack surface in two directions. First, agents that orchestrate multiple model calls as part of a workflow expose each subordinate model to extraction — not just the primary interface. A security researcher or attacker with access to an agentic system can probe not just the visible model, but the reasoning, retrieval, and evaluation models the agent calls internally if any of those calls are accessible or inferable from the agent's outputs.

Agents can be weaponized as distillation infrastructure. An agent with broad API access and autonomous query generation capability is, architecturally, a query corpus generator with goal-directed optimization. An attacker who compromises or manipulates an agent's objective function can direct it to conduct distillation queries against target models as a background task — all within the envelope of the agent's authorized behavior.

Zero-trust architectures for multi-agent systems need to account for this. Agents should not have unconstrained query authority against model APIs. Query budgets, semantic monitoring, and output logging for agent-to-model calls are not optional controls in a high-value AI environment — they are the baseline.

Five Questions for Your Board

1. What is the estimated value of our proprietary model assets — including training data, fine-tuning investments, and the accumulated calibration decisions embedded in production models — and is that value reflected in our cyber risk quantification?

2. Do we have baseline monitoring on query patterns for externally accessible model APIs, and what would anomalous behavior look like relative to that baseline?

3. If a competitor deployed a surrogate of our proprietary underwriting, diagnostic, or recommendation model tomorrow, what would be our legal recourse, and how long would attribution take?

4. Are our API response payloads returning information beyond what clients require for their stated use case, and have we performed a risk assessment on the information content of those responses?

5. Does our incident response program include a defined playbook for model extraction events, and has that playbook been tested?

6. Who in this organization is named as the owner of model security posture for every production AI system — and does our data governance framework classify model behavior as a protectable asset with defined breach criteria?

Technical Reference

Techniques: Active Learning-Based Extraction · Model Stealing (Knockoff Nets) · Membership Inference · KL-Divergence Minimization · Query Watermarking

OWASP LLM Top 10: LLM10:2025 — Model Theft

OWASP LLM Top 10: LLM06:2025 — Sensitive Information Disclosure (logit-level output leakage)

OWASP LLM Top 10: LLM02:2025 — Insecure Output Handling (response payloads not assessed adversarially)

Key Research: Knockoff Nets — Orekondy et al. (2019) · DAWN Dataset Inference · Radioactive Data Watermarking

Detection Tooling: API Gateway Semantic Query Logging · Per-Client Distribution Profiling · Semantic Rate Limiting (embedding-cluster-based)

owasp.org · atlas.mitre.org · NIST AI

"When AI Attacks" is a practitioner-grade security intelligence series written for CISOs, security leaders, and defenders navigating the AI threat landscape.

The scenarios described in this series are grounded in documented, publicly reported threat intelligence patterns. They do not reflect confidential information from any employer.

Model Distillation