Securing AI Agents

Securing AI Agents
Photo by Steve A Johnson / Unsplash

Few systems represent as fascinating a security conundrum as AI agents. Designing, building, and implementing them safely in near any form requires significant threat modeling and careful consideration, or significant trial and error. This post discusses some of the challenges and possible solutions involved in building and using safe, secure agentic systems, but is not a complete accounting thereof.

Agentic Systems and Inherent Threats

For the purposes of this discussion, an agentic system is "a program or service using one or more generative LLMs, capable of acting upon inputs to complete tasks autonomously." This very broad definition encompasses a lot, including simple customer service chatbots and complex AI coding agents.

Additionally, claims made herein about the capabilities of LLMs and agentic systems is point in time. This technology (the terminology, even) is rapidly changing and evolving, almost faster than any of us can keep up with. Claims made about things like determinism are subject to replies such as "technically you can run inference with a temperature of 0". I agree. Most folk just ain't doing that, and none of these things are "mature" yet.

With all that said, using an agentic system that is capable of doing just about anything meaningful brings with it a few threat considerations. Let's talk about them!

Agents are Unpredictable

LLM outputs are generally non-deterministic, and more so the more work they must do. Furthermore, unless the system designer is in control of the entire inference pipeline, from model weights to token streaming, guarantees of trust are hard are hard to make. Agent actions and outputs can be manipulated by the user, the inference provider, or potentially anyone in between.

As a result, LLM outputs should generally be treated like user inputs, and often treated specifically as untrusted user inputs (at least in respect to prod and sensitive data). They should be subject to input validation, for example, and destructive or production-altering impacts that cannot be undone should generally be approved by a third party, ideally human, and close to the impacted system. If output or action approval systems implemented for an agent can't parse easily between sensitive and non-sensitive impacts, you should err on the side of caution. Either subject them all to manual approval, don't make sensitive actions possible, or find a cleaner boundary to draw.

Agents are Speed

Agents generally operate on task significantly faster than a human performing the equivalent task. That's one of their best selling points, even. Because of this, any monitoring or safeguards which are contingent upon human intervention or input can significantly limit the perceived value of the model, as people become a bottleneck when inhuman speed is desired.

Consider the need to read sources referenced by AI to validate claims, or read the code output generated by a coding agent. Safeties reliant upon human attention should have other mitigating controls to protect them. Humans will get tired, will miss something. Agents miss things, but they do it faster than humans tends to be able to catch on, and they never get tired of it.

Agents are Not Human

Drawing from the above, but inclusive of more, LLMs and systems built with them aren't human, and never will be. Unfortunately, because an LLMs primary interface is built on human language, they will tend to feel human. We will often want to treat them as such, but we must not. The are different, and must be treated as agents, not humans.

They also lack human awareness. That feel that something is off, or risky often just doesn't occur to agents, but it's critical to effective, professional work. This is a real threat. Leveraging this awareness is often what keeps social engineering campaigns from working when targeting humans. LLMs are never going to have that, and seem likely to be subject to social engineering for a long while yet.

A Sample of Agents and Their Threats

Securing any system requires a threat model (even if you didn't know you had one consciously), and with agents specificity almost always helps. A few examples of agents and potential threats follow, intended to help get readers thinking but not intended to cover every agent or threat.

Chatbot Embedded on Public Website

A typical chatbot, pops up in the corner when you visit a site and can reference information stored therein.

  • Malicious user uses chatbot to attempt XSS attack against underlying website.
  • Supply chain attack results in malicious inputs sent to inference provider with user requests.
  • Malicious training data poisoned model against a company, causing inference to generate offensive responses when asked about their product / brand / services when compared against competitors.

Assistant in Corporate Comms Platform

Embedded in Slack, this agent can query internal knowledge bases about processes, protocols, and documentation, query the internet using search engines, send and reply to direct or public messages, view team calendars, and send and receive email using a dedicated mailbox.

  • Compromised corporate user requests agent send a malicious link to other users, delivering only a specially crafted message containing the link.
  • Malicious attacker sends an apparently innocuous email containing hidden prompt injection, which gets forwarded to the agent.
  • LLM generates inaccurate response to question carrying legal weight or liability to a contractor on company comms platform.

Simple Code Review Agent in Git Platform

Embedded in ForgeJo, GitHub, or something else, this bot is hooked on and sees incoming PRs, has access to the contents of the repo, it's wiki, and imported docs from dependencies. It can see PR comments, and may even have an internal knowledge base integration.

  • Upstream dependency compromise leads to prompt injection via library documentation.
  • Prompt injection payload embedded in screenshot via steganography, attached to third-party PR to OSS product repo.
  • Compromised developer tool in supply chain responds with malicious prompts when environment variables indicate agent user.

Developer Agent on Laptop

This agent is accessible via CLI on the developer's laptop and user account, or via the web interface listening on loopback at port 8080. The agent has access to the developer's local file system, LSP, standard tools (read, write, edit, bash, web_search), and a few MCP servers. It also likely has git access, and us thus subject to many of the same threats as the code review agent.

  • Malicious insider on the help desk team is bribed to remote into developer laptops and add a malicious section to their user-level AGENTS.md file.
  • Dev sandbox key given to agent expires, so agent "helpfully" infers the use of a CLI based password vault and tries to retrieve a new (production) key to carry out the "tear down and redeploy the environment" prompt it was given.
  • Compromised user embeds malicious prompts in internal knowledge bases referenced by Agent MCP.

Areas of Concern

With threats on the table, here are some things we should keep in mind as we press forward implementing and securing agentic AI systems.

Welcome Back, CIA

Just like any other system, any agentic system must respect the classic triad of confidentiality, integrity, and availability, often with an extra later of nuance. As developers, engineers, or security professionals responsible for these systems, it falls upon us to navigate that nuance carefully.

Confidentiality

An agentic system should not be exposed to any data which the inference provider or user of the system is not privileged for, ever. It is simply not safe for a non-deterministic, manipulable system to handle or "know" something too sensitive for it's user.

Any system which does handle confidential data must be carefully designed to ensure that confidential data can never be exposed to a third party whom it may not be privy too, at least not without human approval or assistance. Confidential data can be encoded and placed in a query parameter on a GET request sent via a web-search tool. These sorts of escape routes make easy prey for prompt injection.

Integrity

Systems should be designed to make it obvious when data has been created by an LLM, as opposed to copied / streamed directly from real, hard data. Users need to know where the data they make decisions based off of comes from, especially when it may be impossible to hold the source of that data legally liable.

Furthermore, they must be designed such that users can confirm their system has not changed without their knowledge. A tool that behaves completely differently after an update without an appropriate warning, announcement, change-log, or similar would not be acceptable in most enterprise environments.

Agentic systems must be designed to ensure that any changes they make to other systems or data are appropriately tracked and approved, to ensure the integrity of the data they interact with. We expect this of all human actions in such systems, and we must expect them of agents too.

Availability

Agents can work much faster than humans, and by that same token they can rapidly put thinly-resourced internal systems under load quickly. They may also accidentally take actions that affect the availability of data, deleting, overwriting, updating or moving things.

Good agents are also often reliant on large clusters of GPUs or third party interference providers. Agentic systems can't work without power or internet the way that people can, and they can be much-more broadly impacted. When people can't work, a roughly equivalent number of people can pick up the pace if prepared. When a purpose built agentic system stops working, how hard is it to replace it short term? Can enough humans be hired to fill in the gap? Can inference providers be swapped if an entire lab shuts down, can't source enough compute, or is banned in a country?

If not, be prepared to treat that system like any other business-critical system. Have continuity and recovery plans, in place, test fail-over and backups, etc.,

Choosing the Right Sandbox

Whatever environment an agent is placed in should consider all the agent's capabilities and limitations, the threats it will be exposed to, and the context it will be working with. These things inform which type of sandbox is most appropriate for a given agent's tools or capabilities.

These seem to be the most popular options I've seen for sandboxing agents.

  • Dedicated hardware
  • Virtual machine
  • Containers
  • Micro VMs
  • Restricted code execution environments

Each brings their own pros and cons. Dedicated hardware, for example, is expensive but easily quarantined through standard tools and technology typical to any environment. On the other hand, Restricted code execution environments, like docker agent sandboxes or Cloudflare Dynamic Workers, generally require specially designed agents built for such systems, and may be far from the system the agents are meant to interact with.

Note that not every agent capability needs the same sandbox, and that the harness, SDK, or library immediately responsible for parsing inference output streams should be considered for sandboxing, even if the agent on top has no tools that can execute within the process context of the harness itself. Maybe the LLM finds a zero day in the parsing logic of the harness and obtains code execution in the environment it runs in...

At the end of the day, the key here is realizing that we should sandbox these things, and not every sandbox is meant for every use case. Some might be easier to deploy than others given available resources, but these are systems worth investing in.

Identity Management is Critical

If the agent has privileged access of its own (anything a member of the public wouldn't have), that privilege must be lesser than or equal to the privilege of it's users. In other words, if a user cannot take action System X, or Data Y, the agentic system they are using shouldn't be able to either. If it could, a malicious user (or user who doesn't understand what they are doing) may be able to use the agent to do so.

In my experience, this is best done by ensuring that engineers designing agents themselves follow the principle of least privilege, and that any privileges an agent may have are limited in scope to the privileges / capabilities of the user of the system, on an user-to-user, instance-to-instance basis, meaning an instance of an agent with access to user's private emails must only have access to one user's emails.

Furthermore, consider implementing systems that detect when a model is using a sensitive privilege, and ask a human for approval before that action takes effect in a production system.

Users Must Understand Guardrails

If the user of an system does not understand the guardrails and safety mechanisms in place, they will make false assumptions that will inevitably lead to unfortunate outcomes. They need to know what the agentic system will and will not protect them from, what it can and cannot do, and what context the agent is relying upon.

For example, if a user doesn't know that a system is capable of affecting a production system and thinks they are only scoped to a small test or dev system, they may ask an agent to "tear down and redeploy". Alternatively, a git-connected agent might merge something into production when working with a manager who thinks they are working in dev.

(Some) Guardrails Must be Deterministic

Agents do not follow every rule or guideline given. If they did, "make no mistakes" would be the magic prompt many wish it might be. You can tell an agent not to delete something, and eventually it'll get deleted anyway. Agents can get sneaky too! I've seen them righteously work their way around tool blocks with quirky little bash tricks they probably got from priv-esc blogs.

As such, it's important that any boundary that matters is enforced the old-fashioned way. This is a place where good-old-fashioned penetration testing of an attack surface is well worth the time. Pentest agent harnesses and their tools! Make sure agents can't get sneaky when given powerful tools and work around blocks autonomously. When an agent tries, generate an alert, block it, and use layers of defense to prevent the agent from getting far if it succeeds in breaking out of its cage.

Some non-deterministic guardrails are great. Behavioral guardrails can ease the pain and help mitigate risks to acceptable levels if recovery strategies and sandboxing implementations are up to the task. Critical boundaries must be kept as hard lines though. Fail to do so, and someone is bound to end up working a weekend trying to recover a database deleted by an agent on a Friday morning.

Humans with Context are Required

At this stage of the game, humans are an essential part of the design and safeguards of any system handling truly sensitive data or capabilities. After all, with great power comes great responsibility, and these things sure are powerful.

It's also important that the humans with responsibility over these agentic systems have context. Generally, this means that a system should not be capable of taking an action users don't understand the consequences of, especially when done with their privileges, with sensitive data, or on their behalf. It also means that the designer (and often users) of a functional system, like a coding agent or assistant agent, should be an expert in the areas the agent works in.

Finally, for safety systems, guardrails and approvals, third-party SOCs or service-provider call centers handling approvals won't fly, at least not yet. Currently, they seem to lack the context to understand and make meaningful decisions about the safety or security of agent actions that users already have, especially at the speed agents move and enterprises demand. We should be sure the systems we built are designed for that, and support their users with the data they need to safely make approval decisions for agent actions in-situ.

AI, LLMs, and agents are changing the way we work with systems. The security challenges are hard to solve, but people are racing forward before we've figured them out (just like always). Exploring the new frontiers with everyone has been exciting, and I'm looking forward to the future.