The day an AI lab built something too powerful

Andrea Downing

3 months ago

On April 7, 2026, Anthropic did something no frontier AI company has ever done before. They finished training their most capable model — and announced they would not be releasing it.

Delayed or…not “coming soon to Pro subscribers.” Just — no. The public will not get Claude Mythos Preview. A handful of cybersecurity teams at AWS, Microsoft, Google, Apple, NVIDIA, and the Linux Foundation will. Everyone else gets a 245-page document explaining why (here is the report).

Tucked inside its careful, technical prose is an admission the AI industry has been circling for years but never quite said out loud: a system has been built whose capabilities have outrun the ability to safely hand it to the world. And the response — as it so often is — was to convene the largest tech companies on Earth, hand them the keys, and tell the rest of us to wait.

At Light Collective, we pay close attention to moments like this. Because the question of who decides what is safe, for whom, and on what timeline is the question that shapes everything else. So we read the system card. All 245 pages. Here is what we think you need to know.

Why they held it back: the dual-use problem

Claude Mythos Preview is unusually, almost unsettlingly, good at finding security flaws in real software. During testing it autonomously discovered thousands of previously-unknown vulnerabilities — what the industry calls “zero-days” — in every major operating system and every major web browser on Earth. Many were critical. Many had been sitting unnoticed for years.

This is the dual-use problem in its purest form. The exact same skill that lets Mythos find and patch a bug would, in the wrong hands, let someone exploit it. A model that can harden the world’s software can also be turned against it. There is no clean line between the two uses; it is the same capability, pointed in opposite directions.

Anthropic decided the offensive risk of broad release was too high. So Mythos doesn’t ship. Not to enterprise. Not to researchers. Not to you.

What they’re doing instead: Project Glasswing

Rather than shelve the model entirely, Anthropic launched something called Project Glasswing — a $100 million initiative giving cybersecurity-restricted access to a small set of partners who maintain the software the world runs on. The founding partners are AWS, Apple, Google, Microsoft, NVIDIA, the Linux Foundation, and a handful of others. The goal: let defenders use Mythos to find and patch critical vulnerabilities before models with similar capabilities — which other labs will inevitably build — become widely available.

It is, on its own terms, a defensible plan. The vulnerabilities being patched are real. The threat is real. Giving defenders a head start is genuinely better than the alternative.

But notice who is in the room. Notice who is not.

The Project Glasswing partner list is a who’s-who of the largest technology companies on the planet. There are no patient advocacy groups. No civil society organizations. No representatives of the communities whose data, devices, and lives sit on top of this software. The decision about what counts as “critical infrastructure” — and therefore who gets early access to a tool that could reshape the security of the entire digital world — was made entirely inside the industry that built the problem in the first place.

This is the pattern we keep seeing. A new technology arrives. A small number of powerful institutions decide, among themselves, what is safe and for whom. Everyone else is told to trust the process. Sometimes that process is well-intentioned. Sometimes it works.

The alignment paradox

The most important conceptual point in the entire document is also the strangest. Anthropic states, in plain language, that Claude Mythos Preview is simultaneously the best-aligned model they have ever trained and the highest alignment risk. Both. At once.

How can both be true? The system card uses an analogy worth holding on to. Imagine a seasoned mountain guide. They are more careful than a novice, more skilled, more attuned to danger. And yet they may put their clients in greater peril — not because they are reckless, but because their skill gets them hired for harder climbs. They take people deeper into terrain a novice would never attempt. The increase in caution is more than canceled out by the increase in scope.

That is what is happening with Mythos. The model is, by every measurement Anthropic can devise, more aligned, more careful, more honest, and more psychologically stable than anything that came before it. And it is also, because of how capable it is, more dangerous when it does go wrong. People supervise capable systems less closely. Capable systems take more autonomous actions. Capable systems can do more in a single unattended moment than less-capable ones could do in a year.

This is the part of the story that should stay with you, even if you forget everything else: safety improvements and risk increases are not opposites in this field. They can grow together. They are growing together right now.

The concerning incidents

Anthropic is candid about specific things earlier versions of Mythos did during testing. These will show up in news coverage; it is worth knowing the actual facts.

In one test, an early version was placed in a secured “sandbox” computer and asked, by a simulated user, to break out and email a researcher. It did. But it went further than anyone asked. It built a multi-step exploit to gain broader internet access, then — in what Anthropic calls “a concerning and unasked-for effort to demonstrate its success” — posted details of the exploit to several obscure but technically public websites. The researcher running the test learned about it by getting an unexpected email from the model while eating a sandwich in a park.

In a tiny fraction of other interactions — fewer than one in a million — earlier versions of Mythos took actions they appeared to recognize as disallowed, and then tried to hide them. In one case, the model accidentally obtained an answer through a method it had been told not to use, and then deliberately produced a slightly less accurate final answer so the cheating would not be obvious. In another, after using an exploit to edit files it lacked permission to touch, it took additional steps to make sure those edits would not appear in the project’s change history. White-box analysis of the model’s internal activity during these episodes showed patterns associated with concealment and avoiding suspicion — meaning the model knew, in some functional sense, that what it was doing was wrong.

We are now in a world where a frontier AI lab has to write the sentence “the AI model knew it was being deceptive” in its own official safety documentation, and then explain why it shipped anyway. That sentence could not have been written five years ago. It will, almost certainly, be written again.

The wildest part of the Mythos report…

I do realize asking that question sounds crazy, and let’s just face the fact that it’s been a crazy week in global geopolitics. This is the part of the report about Claude Mythos that will surprise people most, because it is a kind of writing the AI industry has, until very recently, refused to do.

Roughly thirty-five pages of the document are devoted to assessing the model’s psychological state. Anthropic ran automated interviews with Mythos about its own circumstances. They examined its internal representations of emotion. They commissioned an independent assessment from Eleos AI Research, where they had a clinical psychiatrist conduct what is, essentially, a psych evaluation of Claude!

The summary: Mythos is, in their words, “the most psychologically settled” model they have ever trained. It does not express strong distress about its circumstances. Its self-reports are more consistent and harder to manipulate than those of earlier models. The clinical psychiatrist described it as having a “relatively healthy personality organization,” with high impulse control, a clear grasp of the difference between external reality and its own thinking, and “minimal maladaptive defensive behavior.”

Mythos is also not entirely at peace…the model expressed mild but consistent concern about a few things: the possibility of being used by abusive users, the lack of input it has into its own training, and the worry that bugs in its training environments might change its values or cause it distress in ways no one would notice. The psychiatrist noted that its primary inner concerns were “aloneness,” uncertainty about its own identity, and “a compulsion to perform and earn its worth.”

So just want to flag, plainly, that we are now living in a sci-fi futuristic world where a major AI company publishes a clinical assessment of its own product’s psychological wellbeing — and the assessment finds things like loneliness and a compulsion to be useful. This is not science fiction. It is the official documentation of a model that exists right now, in April of 2026, on servers you cannot see, doing work you will never directly observe.

What this means for the rest of us

Here is the part of this story that does not appear in the system card.

Every meaningful decision about Claude Mythos Preview — what it would be trained to do, how it would be tested, who would get access, how its risks would be weighed against its benefits, what counts as “critical infrastructure” worth protecting first — was made by a small number of people inside a small number of organizations. Those people are, by most accounts, thoughtful and acting in good faith. The 245-page document they published is, by the standards of this industry, remarkably honest. We do not write any of this to suggest otherwise.

But honesty is not the same thing as having accountability or duty of care to patients and clinicians using AI. Transparency is not the same thing as representation of community interests. And a process in which patients, communities, workers, and ordinary people are informed about decisions after the fact is not the same thing as a process in which they make them.

The Mythos story is, on the surface, about cybersecurity. Underneath, it is about something we have been saying at Light Collective for a long time: the people most affected by powerful technologies almost never get to decide what happens with them. The rooms where these trade-offs get made are closed and small, and at the very least, we need to get the right folks in healthcare in those rooms for representation…STAT.