“Model misbehavior” is often treated as a quality issue. In production, it is a security and reliability issue.
LLM systems can deviate for two reasons: hostile manipulation (for example prompt injection) and non-malicious system failure (context bleed, tool-routing errors, memory contamination, policy conflicts). Different causes, same outcome: the model starts doing things you did not intend.
Research from Irregular highlights how instruction collisions and adversarial prompt patterns can force agents to bypass guardrails or misuse connected tools, especially in multi-step workflows (Irregular research). This is where organizations get trapped in the wrong framing. They look for “data exposure” while the real incident is functional compromise: wrong approvals, wrong updates, wrong downstream actions.
When an AI assistant only answers questions, failures are embarrassing. When it can act through APIs, failures become operational and potentially regulatory.
The shift is simple but profound: treat unpredictable AI behavior like you treat unstable privileged code paths. Would your SOC classify agent drift as an incident today, or as “just an LLM glitch” until business damage appears?
