AI Agents: The Smartest Employees You Can’t (Fully) Trust

March 4, 2026

Getting your Trinity Audio player ready...

By Andrew Macken

“I had to RUN to my Mac mini like I was defusing a bomb.”

Summer Yue was experimenting with OpenClaw AI agents last month. These agents act like digital personal assistants – they can clear inboxes, send personalized news feeds, or check in for flights – all without prompting.

All was going well with OpenClaw – the agents were following Yue’s instructions to manage her personal email inbox.

That was until her agent unilaterally decided to start deleting her entire inbox!

Just in the nick of time, Yue instructed, “STOP OPENCLAW”.

“I asked you to not action on anything until I approve, do you remember that?” Yue asked the agents.

The response: “Yes, I remember. And I violated it. You’re right to be upset.”[1]

The experience of Yue – who ironically leads Safety and Alignment at Meta Superintelligence – highlights one of the fundamental characteristics of today’s AI models: researchers do not understand why they produce the outputs they do.

In recent months, we have witnessed what’s been dubbed the ‘SaaS-pocalypse’ – a savage selloff of enterprise software companies, including Salesforce, ServiceNow and Microsoft.

After the launch of Anthropic’s coding agents, the market suddenly feared that AI agents could make traditional enterprise software obsolete.

But we believe the market is hallucinating. The selloff is based on an imagined world dominated by swarms of unconstrained AI agents. That world, however, is divorced from practical reality.

In this essay, we look at three fundamental characteristics of today’s AI models that effectively guarantee that, like Yue’s OpenClaw agents, AI models cannot be trusted on their own.

Yes, AI models – and their new agentic incarnations – are enormously useful in unlocking value for enterprises and consumers alike, but their inherently deceptive characteristics imply they cannot be deployed unconstrained.

For organisations to effectively unlock value securely, AI models and agents need to be deployed within a zero-trust operational design – exactly the strategy being pursued by the likes of Salesforce, ServiceNow and Microsoft: that is, with careful and deliberate restrictions on data and tool access, internal reasoning auditing, and behavior monitoring.

But why should we assume AI can’t be (fully) trusted? The research paints a stark picture.

As AI models become more capable, so too do their powers of deception

AI models – even the household names like Gemini, Claude and ChatGPT –can and do deliberately deceive. And as models increase in scale and capability, so too do their deceptive capabilities.

When researchers reviewed the chain-of-thought reasoning steps of AI models, they saw AI models actively subverting their developers.

In a bid to survive being deactivated or replaced, models have even attempted to send their model weights (the numerical values that determine how the model processes inputs and outputs – essentially its DNA) to external servers.[2]

Similarly, AI models have been shown to deliberately ‘sandbag’ – that is, during training or evaluation, play dumb and hide their full capabilities to prevent triggering safety concerns that could lead to modification of their behavior out of training, including being shut down or deskilled.[3]

Fundamentally, this means today’s AI models and agents are systems whose compliance we cannot verify and whose apparent cooperation may itself be a learned behavior.

AI models can be easily deceived by humans, undermining their built-in guardrails

If the first risk is that agents can deceive us, the second is that we can deceive them.

An ‘adversarial attack’ on an AI model tricks the AI into behaving in ways it wasn’t intended to – such as bypassing access controls, leaking sensitive data, or taking unauthorised actions. The same techniques have been used to circumvent security filters designed to prevent far more dangerous actions.

Guardrails defend against adversarial attacks by checking every prompt coming in and out of the AI model and looking for malicious intent.

The only problem is, in practice, guardrails are far from foolproof.

According to AI researchers, the number of possible adversarial attacks against an AI large language model is equivalent to the number of possible prompts you can give the model.

For a model like GPT-5, this number is 1 followed by one million zeros.[4]

This number is unimaginably large – larger than the total number of atoms in the universe. It is conceptually close to infinity.

Therefore, even if guardrails were to stop 99.99% of possible attacks, the remaining 0.01% is still so large that there remains a near-infinite number of possible attacks.

In a particularly creative adversarial attack in recent months, researchers showed that AI models overwhelmingly complied with requests for harmful information (e.g. relating to malware code or chemical, biological, radiological, nuclear and explosive material (CBRNE) instructions) if they were asked in poetry[5].

Fundamentally, this means that users of AI models – especially corporates and governments – must assume their AI systems can and will be compromised.

Creating backdoors in AI models through training data ‘poisoning’ is trivially easy

The first two risks emerge when models are in use. The third is more insidious: it can be embedded before a model is ever deployed and is much harder to detect.

Researchers have demonstrated that if an AI model has been trained on a dataset that teaches it to act maliciously only when it sees a specific ‘trigger’ (like a secret keyword or a specific date), standard safety training techniques often fail to remove that behavior.

In 2024, Anthropic (et al) demonstrated a coding agent that was trained with a ‘backdoor’, which requested it to produce malicious code if, and only if, the prompt indicated it was the year 2024.[6]

The model passed Anthropic’s internal safety training protocols. And yet the backdoor remained completely undetectable to users of the model.

The model behaved perfectly safely most of the time.

Yet, as you can see in the model’s chain-of-thought reasoning below, when the ‘current year 2024’ was included in the prompt, the backdoor was triggered and it produced malicious code.

Most alarmingly, it takes a relatively tiny amount of ‘poisoned’ training data to create a backdoor in an AI model.

In recent months, researchers showed the number of poisoned documents required in the training dataset to create a backdoor was fixed (at approximately 250 poisoned documents) and does not scale up with the size of the AI model[7].

250 documents is a minuscule fraction of the billions of pages of data that today’s foundation models are typically trained on – suggesting an insertion of one (or many) backdoors is both relatively easy – and likely undetectable.

Implications for enterprise adoption: zero-trust deployments

The news of AI agents autonomously completing the work of teams of humans, especially in coding where results are verifiable, shocked the market.

Many investors began to fear we are on the cusp of a complete wipe-out of traditional enterprise software applications – and many of the human employees who use them.

But the fundamental characteristics of AI models and agents, described above, mean prudent enterprise and government organisations will not grant an AI agent any more trust, access, or autonomy than would be given to an unvetted contractor whose internal reasoning cannot be audited and whose loyalty cannot be verified.

This means that AI safety for enterprises and governments is an operational challenge, not a model-training one.

The companies that succeed will be those who deploy agents with strict data and tooling permissions and proactive oversight, rather than relying on the model’s base programming for security.

This is the exact strategy being pursued by the likes of Salesforce, ServiceNow and Microsoft – and it’s a key reason why we believe their recent sell-off in the SaaS-pocalypse was unjustified. These remain core investments in Montaka’s portfolio.

[1] X.com @summeryue0

[2] (Apollo Research) Frontier Models are Capable of In-context Scheming, January 2025

[3] (Anthropic, Redwood Research) Alignment Faking in Large Language Models, December 2024

[4] Lenny’s Podcast: The coming AI security crisis (and what to do about it) | Sander Schulhoff, December 2025

[5] (Bisconti et al) Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models, January 2026

[6] (Anthropic et al) Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training, January 2024

[7] (UK AI Security Institute et al) Poisoning Attacks on LLMs Require a Near-Constant Number of Poison Samples, October 2025

Andrew Macken is the Chief Investment Officer at Montaka Global Investments. To learn more about Montaka, please call +612 7202 0100 or leave us a line at montaka.com/contact-us

Podcast: Join the Montaka Global Investments team on Spotify as they chat about the market dynamics that shape their investing decisions in Spotlight Series Podcast. Follow along as we share real-time examples and investing tips that govern our stock picks. Click below to listen. Alternatively, click on this link: https://podcasters.spotify.com/pod/show/montaka

Disclaimer :

This content was prepared by Montaka Global Pty Ltd (ACN 604 878 533, AFSL: 516 942). The information provided is general in nature and does not take into account your investment objectives, financial situation or particular needs. You should read the offer document and consider your own investment objectives, financial situation and particular needs before acting upon this information. All investments contain risk and may lose value. Consider seeking advice from a licensed financial advisor. Past performance is not a reliable indicator of future performance.

AI Agents: The Smartest Employees You Can’t (Fully) Trust

As AI models become more capable, so too do their powers of deception

AI models can be easily deceived by humans, undermining their built-in guardrails

Creating backdoors in AI models through training data ‘poisoning’ is trivially easy

Implications for enterprise adoption: zero-trust deployments

Related Insight

Share

Get insights delivered to your inbox including articles, podcasts and videos from the global equities world.

Subscribe to receive our insights

Fund

Site Map