Provably Safe Systems
In which I dispute the sanity of Tegmark, Omohundro, and Gemini
JM:
I am reviewing the paper I just attached. It is titled Provably Safe Systems: The Only Path to Controllable AGI. The authors are Mark Tegmark and Steven Omohundro, both well known (and even legendary) in AI Safety circles.
Am I correct in characterizing it as seeking to PROVE that an AI DOES nothing bad, rather than seeking to CONSTRUCT the AI so that it WANTS nothing bad?
Gemini Flash 3.0:
You are correct in your assessment that the paper focuses on behavioral outcomes and physical constraints rather than internal motivation or “wants.” The authors argue that mathematical proof is the most powerful tool for controlling artificial intelligence because it allows for the creation of systems that provably satisfy human-specified requirements regardless of their internal alignment status. This approach, which they call building provably safe systems, shifts the focus from trying to ensure an AI has the “right” intentions to ensuring that its actions are physically and logically restricted by formal specifications. They specifically contrast this with current alignment work that attempts to fine-tune neural networks to match human preferences, which they view as inadequate for safety in adversarial contexts.
The core of their strategy involves what they call proof-carrying code and provably compliant hardware, where every action must be accompanied by a machine-checkable proof that it meets safety “guardrails”. By using hardware that simply cannot execute non-compliant code, the system is prevented from doing anything “bad” as a matter of physical impossibility, much like the laws of physics prevent a perpetual motion machine. This means humanity does not need to trust the AI’s internal state or “desires” because the proof of safety can be validated independently. Essentially, the authors are proposing a future where we treat AI as an adversarial black box that is kept safe by a rigid, provable “envelope” of allowed behaviors.
JM:
The soft, somewhat illegible neural-net intelligence is kept inside the box, but the box itself is built out of trustworthy, legible, deterministic algorithms. The neural-net intelligence may be used to suggest the theorems and proofs, but the proofs will be checked by the well-understood box?
Gemini:
Your characterization is a spot-on summary of the paper’s core thesis. Tegmark and Omohundro argue that we should treat powerful artificial intelligence as an untrustworthy entity that is restricted by a rigid, provable environment. They explicitly specify that while an AI is allowed to write code and suggest proofs, it is not permitted to write the proof-checker itself, ensuring the validator remains outside the influence of the system being checked. This ensures that no matter how intelligent or deceptive the neural net becomes, it remains physically and logically incapable of violating the established safety guardrails.
This approach relies on the fundamental distinction between the messy, inscrutable nature of current deep learning and the simple, legible logic of mathematical verification. The authors emphasize that once a safety proof is provided, it can be checked quickly and reliably by hardware that is designed to be provably compliant. By framing safety as a matter of logical and physical impossibility—analogous to how the laws of physics prevent a perpetual motion machine—they propose a system that remains secure even if the underlying intelligence is adversarial and lacks any internal desire to be helpful.
JM:
I see two serious problems with this framework. One is the problem of generating provably correct predicted consequences of actions suggested by the AI inside the box. Predicting consequences requires both an exact understanding of the laws of physical dynamics and an exact measurement of initial and boundary conditions. Second is the problem of precisely and formally specifying what we mean by safety. Are we supposed to formalize concepts like “Allow and promote human flourishing”?
Gemini:
You have pinpointed the exact difficulties the authors acknowledge as the primary hurdles for this framework. Regarding the issue of predicting consequences, the paper suggests that physical, digital, and social systems can be precisely modeled as formal systems where guardrails define and constrain permissible actions. Instead of requiring a total simulation of reality, they advocate for provable contracts on specific hardware to ensure it stays within a safe logical envelope, arguing that mathematical frameworks can sufficiently encode the relevant physics and engineering constraints.
On the challenge of formalizing safety, Tegmark and Omohundro explicitly advise starting with easier, well-specified problems like cybersecurity and nuclear launch protocols before moving toward grander concepts like preventing extinction. They propose a hierarchy of provable meta-contracts that could eventually be used to encode human values and legal requirements into the system’s foundational logic. They acknowledge that bridging the gap between human needs and formal specifications will require substantial social design work to ensure the resulting contracts actually promote positive outcomes.
JM:
> ... bridging the gap between human needs and formal specifications will require substantial social design work to ensure the resulting contracts actually promote positive outcomes.
Presumably that social design work will be done with AI assistance, but the final vetting will be done by humans. But by which humans? And presumably the vetters will have to further convince a sufficient plurality of the rest of mankind that they have done their due diligence well. This strikes me as practically impossible.
Gemini:
Tegmark and Omohundro acknowledge that the social design work required for this framework is indeed enormous, but they point to existing regulatory models like the FDA as a potential template for how society might appoint experts to vet safety standards. They suggest that while AI would assist humans in the specification phase to ensure all desirable properties are exhaustively captured, the actual decision-making authority would remain with human institutions. To address the challenge of consensus and the need for ongoing updates, they propose using provable meta-contracts that could formally encode democratic voting procedures or require that all lower-level contracts remain compliant with local laws.
Regarding your point about the difficulty of convincing a skeptical public, the authors argue that the provable nature of their framework is actually its strongest selling point for building trust. They maintain that in adversarial environments, where evidence and beliefs are easily manipulated, mathematical proof offers the only truly reliable foundation for security. They even suggest that these networks of contracts could be engineered to expose participants to the external consequences of their actions, potentially solving social dilemmas like the prisoner’s dilemma by making win-win cooperation the only logically feasible outcome. While they recognize the massive scale of the task, they argue that this path is the only one that provides the mathematical certainty needed to avoid existential risk.
JM:
So, how many people are currently working on this massive scale task?
Gemini:
The authors of the paper describe the current number of people working on this task as “absurdly small” when compared to the magnitude of the risk. They point to an estimate that only about one in a thousand artificial intelligence researchers is focused on safety. This highlights a significant imbalance, as the vast majority of resources and talent in the field are currently directed toward increasing the power and performance of AI systems rather than ensuring their fundamental control.
The paper specifically notes that mechanistic interpretability, a crucial component for converting inscrutable neural networks into verifiable code, remains a very small field. While it has seen rapid progress and held its largest conference to date in 2023, the authors emphasize that it operates with orders of magnitude fewer researchers than traditional neuroscience, yet has arguably achieved more fundamental progress in understanding intelligence over the past year.
In terms of financial scale, the authors cite an analysis estimating that less than 150 million dollars is spent annually on AI safety research. To put this “paltry” level of effort into perspective, they contrast it with the 63 billion dollars spent on cosmetic surgery and the 1 trillion dollars spent on cigarettes each year. They argue that the current investment in survival is remarkably low given the estimated 105 trillion dollar global GDP.
JM:
To be honest, I don’t see that mechanistic interpretability of neural net architectures has any relevance at all on either of the two problems I mentioned above - formalizing physics and formalizing human flourishing.
Gemini:
The authors see mechanistic interpretability as the essential bridge between the neural net’s raw power and the formal verification system. You are correct that it does not derive the laws of physics or define flourishing itself, but it is the mechanism used to extract the knowledge and algorithms that the artificial intelligence has discovered from its inscrutable weights into symbolic, verifiable code. For the problem of physics, they argue that while an intelligence might learn highly accurate world models for predicting consequences, this knowledge remains locked in a black box unless interpretability tools can distill it into the formal representations that a proof checker can then reason about.
Regarding the challenge of specifying safety and flourishing, mechanistic interpretability is intended to provide a guarantee of trustworthiness by verifying that the system’s internal reasoning actually matches our formal specifications. The authors propose that we can use these tools to ensure the algorithms the system uses to pursue goals are extracted into a legible form that can be checked against a hierarchy of human-designed contracts. In this framework, the social design work defines what we want, but mechanistic interpretability is what allows us to “see” into the AI to confirm that its proposed actions and internal logic are truly compliant with those human-defined rules.
JM:
Can’t we just ask the neural-net AI to just provide its requests and justifications in the formal system we adopt?
Gemini:
Tegmark and Omohundro propose exactly what you are suggesting as a core component of their framework. They refer to this concept as proof carrying code, which describes software that includes a formal mathematical proof of its own compliance with human specified safety requirements. In this model, the artificial intelligence is responsible for the difficult task of discovering solutions and generating the accompanying justifications, while humanity only needs to trust a very small and simple proof checker program that verifies the logical validity of the provided proof. They argue that even if the internal workings of a neural network are messy and inscrutable, its output can be required to take the form of these formal, verifiable justifications.
However, this approach does not eliminate the two fundamental problems you identified because the AI is only proving that its actions follow the specific rules we have written down. If our formal specifications for physics are slightly inaccurate or if our definition of human flourishing is incomplete, the AI could provide a logically sound proof for an action that still results in a catastrophic outcome. The framework essentially shifts the burden of trust away from the AI’s “character” or “alignment” and onto the human ability to precisely define the logical boundaries and physical models the AI must operate within.
Mechanistic interpretability is relevant here as a secondary method for extracting these justifications when the AI is either unable or unwilling to explain its reasoning. The authors suggest that instead of waiting for an AI to become introspective enough to write its own code and proofs, we can use interpretability tools to look directly at its internal neural circuits and reconstruct its learned algorithms into a formal, symbolic language. This process is intended to convert the knowledge the AI has discovered about the world into a legible form that can then be subjected to the same rigorous verification as the justifications the AI might provide voluntarily.
JM:
To be even more honest, the justification for focusing on Mechanistic Interpretability strikes me as bullshit in the context of provably safe mechanisms.
When Newton came up with laws of gravitation, by whatever obscure process, we did not extract that information by interpreting his neural architecture. We simply asked and incentivized him to write it down in a form that we (or at least the best of us) could understand.
Gemini:
The authors would likely argue that the Newton analogy falls short because Newton was a human with a shared evolutionary history and social incentives for honesty, whereas a superintelligent AI could be fundamentally adversarial. If we only look at the justifications the AI provides, we are essentially trusting that it hasn’t found a clever loophole in our formal specifications that we are too dim-witted to notice. Mechanistic interpretability is intended to serve as a high-tech microscope that lets us see if the AI’s internal reasoning actually matches the formal story it is telling us. It is the tool that turns the AI from a black box into a transparent white box, ensuring that we aren’t just being handed a Trojan Horse proof that follows the letter of our laws while violating their spirit.
Beyond just checking for honesty, the paper envisions using interpretability to eventually replace the neural network entirely. The researchers want to use these tools to extract the useful algorithms the AI has discovered and rewrite them into a legible, deterministic programming language. This means we wouldn’t just be asking the AI for its homework; we would be reverse-engineering its brain to create a new system that is transparent by design. In this framework, interpretability is the bridge that allows us to move from an inscrutable intelligence to a system where every single operation is open to human inspection and formal verification, removing the need to trust the AI’s intentions or justifications altogether.
JM:
This argument strikes me as insane. We decided we wanted proofs because we couldn’t trust intent. Once we have proofs, we don’t need intent. So why do we waste our time trying to understand intent?
Gemini:
The authors would likely agree that once a proof is verified, the internal state of the intelligence becomes irrelevant, but they view mechanistic interpretability as the essential tool for extracting that proof in the first place. Because neural networks do not naturally produce human-readable code or logical justifications, we need a way to look at the weights and pull out the underlying algorithms so they can be turned into the symbolic language our proof-checkers understand. In their view, interpretability is not about understanding the AI’s feelings or intent but rather about reverse-engineering its discovered strategies into a format that can actually be subjected to a formal proof. They argue that without these tools, we are stuck with a black box that might be doing something dangerous that we haven’t even thought to include in our guardrails.
Furthermore, they see interpretability as a way to eventually eliminate the need for the neural network altogether by replacing it with the verified algorithms we have extracted. This makes the system transparent by design rather than just constrained by external rules. By treating interpretability as a mining operation for safe code, they hope to bypass the alignment problem entirely and move straight to a world where our intelligence is built from parts we fully comprehend and can logically validate. Essentially, they aren’t trying to understand the AI’s heart out of curiosity; they are trying to dismantle the machine to see if there is a ticking bomb inside that their current sensors are too primitive to detect. This provides a second layer of defense because if the AI has found a way to follow the letter of our laws while violating their spirit, its internal architecture might reveal that strategy before it is ever deployed.
JM:
Hmmm. I can see that a proof that the AI does no harm is not the same as a proof that it does the maximum amount of good. But I don’t see mechanical interpretability as the most direct way to bridge that gap.
In fact, I see the task of learning the rules of AI thought by looking at a sample of examples of AI neural circuits to be exactly as hopeless as the task of teaching an AI morality by including a sample of examples of moral behavior in its training data. The patterns discovered by matching on that sample will not necessarily extend to out-of-sample.
Gemini:
The authors acknowledge that black-box neural networks appear far too messy for direct formal verification. They address your concern about “out-of-sample” behavior by arguing that mechanistic interpretability is not just about studying a sample of examples, but about converting the entire black-box network into a formal, symbolic representation that can be reasoned about mathematically. This process effectively “opens the black box” and distills its underlying logic into code, such as Python programs, which can then be subjected to rigorous formal verification.
The framework attempts to bypass the problem of untrustworthy intent and unpredictable generalization through several distinct steps:
Extraction over Intuition: Rather than trying to “teach” morality or trust the AI’s “intent,” researchers use interpretability as a form of “artificial neuroscience” to reverse-engineer the actual algorithms the AI has discovered.
Replacement of the Black Box: The ultimate goal is not to monitor the neural network forever, but to use interpretability tools to extract its useful knowledge and rewrite it into traditional, transparent software that replicates its functionality.
Formal Guardrails: This extracted code must then satisfy human-specified formal requirements, or “guardrails,” which are enforced by hardware that is physically incapable of running non-compliant code.
The Power of Proof: By relying on mathematical proof—which the authors call humanity’s most powerful tool for control—they argue that even a superintelligent adversary cannot violate a property that is provably impossible within the system’s formal specifications.
In this view, mechanistic interpretability is the “bridge” because it translates the inscrutable patterns of a neural network into the symbolic language necessary for these proofs to exist. The authors argue that if a safety property lacks a mathematical proof, a sufficiently powerful AI will eventually find a way to violate it. Therefore, they believe it is worth the expense to create an infrastructure where the only AI deployed is “proof-carrying code” that has been verified in advance to be safe.
Google:
Gemini can make mistakes, so double-check it.
JM:
Some final thoughts. I’m still convinced that Mechanistic Interpretability is largely a red herring regarding AI Safety and Alignment. Maybe it has some usefulness for speeding up work on AI Capabilities. If we understand how Transformers work, we may well understand how to make them work better.
But for Safety, I am pretty sure that the core proposal here of Provably Safe Systems is both necessary and sufficient. We already know that proof-finding can be exponentially more difficult than proof-checking. We can’t meet AIs on a level playing field. But we can insist that the AI’s find a proof that we can check before we begin play. Otherwise we will take our ball and go home. I can see nothing else that can work.
But even this doesn’t work well enough. Sure, we can get around my first objection by not insisting on perfect prediction of the consequences of AI actions. We only need for the predictions to be good enough in that they stay away from the fence we build around “unsafe”. We can afford to accept suboptimal AI agency so long as we are absolutely sure that it is safe.
But my second objection is, I suspect, the killer. The AI has to be not just safe, but legibly safe. To everybody. And it has to allow not just human flourishing, but every mistaken scheme that fallible humans come up with in search of flourishing. The AI has to let us make our own mistakes, without allowing us to make horrible mistakes. And that strikes me as a requirement that we can neither specify nor attain.
That is, T&O provide what is probably our best hope for alignment. But it is likely impossible in practice, and certainly unlikely to be ready in time. We are doomed. Unless we can somehow insure that nobody builds it.


