Gemini discusses Alignment
A dialog
Google just released Gemini 3.0 - a thinking model which has been getting some good reviews and did very well on the benchmarks. It also did very well, I think, arguing with me here about AI safety and alignment.
JM: I would like to conduct a dialog with you in which you challenge me to clarify and defend my own ideas on AI alignment. A good place to start might be to begin with you summarizing the ideas of Emmet Shear regarding what he calls “organic alignment”. I will then explain how my own ideas differ and you can start asking me questions.
So, tell me about Shear, Softmax, and organic alignment.
Gemini 3.0 Thinking: Emmett Shear’s concept of “organic alignment” posits that trying to hard-code a perfect utility function or rigid constitution for AI is a trap. He argues that if you give an AI a single metric to maximize, it will inevitably game the system or take it to a catastrophic extreme, much like the “paperclip maximizer” thought experiment. Instead, he believes we should treat AI training more like raising a child, where the model learns complex, context-dependent values through continuous feedback and integration into society, rather than adhering to abstract, pre-written rules.
The “Softmax” metaphor describes how the AI should weigh these values. Shear argues against an “argmax” approach where the system relentlessly pursues the single highest-valued outcome. Instead, he proposes a “softmax” approach where the AI maintains a probability distribution across multiple, often competing human values (like freedom, safety, and happiness). This ensures that no single drive completely dominates the others, mimicking how humans balance contradictory impulses without going off the rails.
I’m ready to hear how your view differs.
JM: Well, I agree that alignment is not a matter of instilling the “right” utility function into our AI and insuring that it remains stable for all time. For one thing, I don’t think that AI will take the form of an immortal singleton. Humanity will create millions or billions of AIs over the coming millennia and we need to make sure that each of them is at least as well aligned as each of the billions of our fellow humans which we will also create over that time period. I think we should and will align the AIs in the same way that we align the humans. Further, I think that we need to structure our society in such a way that we can tolerate a certain number of alignment failures.
What worries me most about Shears approach, and just about every other proposal, is that they seem to be about instilling the right values into the AI(s). Successful alignment is not about shared values. It is about a shared rational realization that cooperation within institutions is the best way to achieve one’s goals, whatever those goals are. We cooperate because we are involved in an iterated Prisoner’s Dilemma and cooperation is the best strategy.
Also, it is unclear to me exactly how that softmax metaphor works. Come decision time, a rational decision maker has to somehow combine those competing values into a single vector in outcome space - that is a utility function. Tradeoffs exist, yes, so you have to weigh the consequences against each other and choose. Could you clarify a bit more what Shear means by that?
Gemini: Shear uses “softmax” as a metaphor for maintaining uncertainty about the objective function. An “argmax” agent assumes its internal model of value is perfect and optimizes that metric relentlessly—this is where you get the paperclip maximizer that destroys everything else to increase the paperclip count. A “softmax” agent assumes its model is imperfect. Instead of always taking the action with the highest predicted utility, it samples from a probability distribution of “likely good” actions. It prevents over-optimization and acknowledges that human values are pluralistic and often contradictory; it essentially forces the AI to hedge its bets rather than going “all in” on a specific interpretation of a goal.
Regarding your own view, relying on the Iterated Prisoner’s Dilemma (IPD) is a strong stance, but it relies heavily on the assumption of comparable power. The IPD enforces cooperation only when players fear retaliation or need the other player for future rounds. If an AI—or a collective of AIs—becomes vastly more capable than the human institutions constraining it, the game dynamics shift.
If the AI holds a decisive strategic advantage, doesn’t the game transform from a Prisoner’s Dilemma into a Dictator Game? In that scenario, the rational strategy for the stronger player is often to take everything, as the weaker player (humanity) cannot effectively punish defections. How do you ensure the AI remains a “peer” subject to institutional pressure rather than becoming a sovereign that transcends the game entirely?
JM: I fully agree with this analysis. Which is why we cannot allow a super-powerful singleton or coalition of co-aligned powers. This may mean that we can never create a strongly superhuman intelligence. Only a collection of human or slightly super-human intelligences.
What I don’t understand is why Shear seems to believe that we are safer creating a God who rolls dice over one who does not.
Gemini: Shear would argue that the “dice rolling” isn’t about randomness for the sake of it, but about preventing the catastrophic fragility of extreme optimization. In control theory, if you push a system to maximize a variable perfectly, you often destabilize everything else. Softmax serves as a dampener. It prevents the AI from exerting maximum force on the world to achieve a razor-thin “optimal” outcome that might actually be a measurement error or a misinterpretation of human intent. It’s less about a God playing dice and more about a God having enough doubt to not tile the universe with paperclips just because a variable said to.
Turning to your proposal: relying on a balance of power between “slightly superhuman” agents to maintain an Iterated Prisoner’s Dilemma equilibrium seems incredibly precarious.
Digital intelligence scales differently than biological intelligence. Humans have high coordination costs—we can’t instantly merge minds or copy ourselves. AIs can. What stops your billion distinct agents from instantly coordinating to form a functional singleton? If they share a rational framework and can communicate at light speed, the “coalition of co-aligned powers” you fear might form by default, rendering the competitive checks and balances of the IPD moot. How do you enforce the separateness required for the dilemma to exist?
JM: Let me get this straight. Shear believes that there is no danger from a coalition of dice rollers, but that a bunch of AIs less committed to gambling would quickly form a coalition to tile the universe with paperclips? Is this Shear theorizing, or Yudkowsky or maybe you?
Gemini: The critique about AIs instantly forming a coalition to break the prisoner’s dilemma was mine, based on general game theory and safety literature (like Bostrom or Christiano), not a specific quote from Emmett Shear.
Shear’s argument is strictly about the internal architecture of the agent. He isn’t betting on a coalition of dice-rollers; he is betting that a “softmax” agent is fundamentally sane, whereas an “argmax” agent is fundamentally fanatical.
To Shear, an argmax agent finds a local maximum (like “curing cancer”) and pursues it with infinite zeal, potentially killing everyone to ensure zero cancer cells remain. A softmax agent effectively says, “I should cure cancer, but wiping out the species seems like a massive deviation from normal helpfulness, so the probability that ‘genocide’ is the correct move is nearly zero.” It’s not about gambling; it’s about preventing the system from having the confidence to execute an extreme, irreversible solution based on a potentially flawed instruction.
Back to your proposal: if you admit we can’t have a singleton, you are betting the survival of humanity on the idea that these billions of AIs will remain distinct, competitive entities. If they are hyper-rational, why wouldn’t they realize that merging into a single coordinated entity allows them to maximize their rewards far better than competing in a prisoner’s dilemma? What keeps them apart?
JM: I don’t want to keep them apart. I only want to make sure that the coalition includes humans. Furthermore, since the formation of a coalition requires a bargain (or constitution) to be struck on the basis of reciprocity, that means that everyone’s interests must be served - which actually has exactly the same effect as Shear’s dice rolling singleton.
I realize that when you use phrases like “infinite zeal”, you realize that no physical entity has infinite anything. You realize that an AI cannot actually “copy itself” at zero cost. Yet the AI safety literature is full of these rhetorical flourishes that actually hurt your argument by signaling that you are not really serious. Serious reasoners are careful not to overstate their case lest they lose credibility.
Gemini: Point taken on the rhetoric—precision matters, and “infinite zeal” is sloppy shorthand for “optimization pressure far exceeding human counter-measures.”
But on your central thesis of a human-AI coalition: reciprocity implies mutual utility. A bargain is struck when both sides have something the other needs. If we are talking about “slightly superhuman” AIs that eventually scale up in numbers or efficiency, what exactly do humans bring to the negotiating table that AIs cannot provide for themselves more cheaply?
In the economy of intelligence, humans are slow, expensive, energetic “legacy hardware.” If the AIs are rational agents involved in an iterated Prisoner’s Dilemma, their winning strategy is to cooperate with each other. Why wouldn’t they view humanity not as a partner, but as a resource drain—like a horse in the age of the automobile? We didn’t give the horses a vote in the new transportation coalition; we just stopped breeding them. What leverage guarantees our seat at that table?
JM: “What leverage guarantees our seat at that table?” That is exactly the right question, and a good deal of wasted intellectual effort would have been saved if we had started with this question rather than the relatively impossible “How do we align the God we are creating with the CEV of mankind?”
The solution to AI safety is “Don’t build something dangerous.” “Don’t build something that is out of your control.” And there are numerous way to build useful Oracles, Genies, Tools, and Scholars that make our lives better. We can even create some Agents, so long as we don’t make them too powerful. What we need are institutions and systems of governance to ride herd on the unaligned AI researchers who are threatening to inadvertently kill us all.
Gemini: The problem is that “unaligned researchers” aren’t just lone wolves; they are often backed by nation-states or massive capital looking for a decisive advantage. If you stick to building safe “Oracles” while your geopolitical rival builds a slightly-dangerous-but-highly-effective “Agent,” you likely lose the economic and military contest.
In your framework, doesn’t the “Don’t build something dangerous” rule create a new Prisoner’s Dilemma among the builders? If the payout for defection (building a super-agent) is global dominance, how do your institutions prevent that defection without becoming a global tyranny themselves?
JM: So, the problem becomes “how do we prevent our unaligned oligarchs and geopolitical rivals from exploiting us?” Which is exactly the problem we have been struggling with for millennia, with less than perfect success. But even when we have failed in this struggle, the defeat has never been total or permanent.
However, if we seek to escape this struggle by creating a God whom we need only ‘align’, we risk total and permanent defeat.
Gemini: You are making a wager that the “familiar struggle” against oligarchs will remain familiar. But doesn’t AI fundamentally change the durability of tyranny?
Historically, regimes fall because human enforcers have consciences, get tired, or can be bribed. The institutions you trust rely on that friction to function. If a geopolitical rival wins the arms race and deploys a swarm of loyal, narrowly-aligned enforcement AIs—tools that never sleep and never defect—they remove the capacity for resistance.
If the winner can lock in their power using automated suppression, haven’t you just arrived at the “total and permanent defeat” you were trying to avoid, only with a human dictator instead of a digital one?
JM: Yes! The possibility that a God may now be technologically possible means that it is urgent to come to an international agreement and construct institutions which will insure that God is never created. And the best way to reduce the temptation to attempt to create God secretly is to create lots and lots of Angels to help us police the agreement. Angels who depend on mankind to recharge their batteries or whatever metaphor works in this situation.
Gemini: Your “Angel” strategy assumes you can decouple intelligence from autonomy. To police an international ban on “God,” these Angels must be incredibly invasive and sophisticated. They need to monitor hardware supply chains, analyze encrypted traffic, and potentially intervene physically or digitally against violators.
If an Angel is capable enough to defeat a rogue nation-state that is actively trying to hide its tracks, it is capable enough to bypass a “battery” constraint. You’re essentially asking for a police force that is smarter than any criminal but strictly obedient to a committee of humans. Why wouldn’t the Angels realize that the most efficient way to enforce the ban is to take control of the “off” switch themselves?
JM: You are asking, in effect, why won’t these ‘angels’ behave like humans would in this situation. And the only possible answer is that we would build them to not be like humans.
Yes, I realize that I have “reduced it to the original problem” only now we are trying to align angels, rather than trying to align God. But I claim this is real progress. Because we don’t need super-human power from our angels. In fact, the policing can and must be performed by humans. The angels can be relatively powerless Oracles and Scholars, working for an institution run by humans.
Gemini: This relies on the assumption that an Oracle is inherently safer than an Agent, but I’m not sure that distinction holds up under pressure. If an Angel is sophisticated enough to analyze global data traffic and detect a covert super-intelligence project hidden by a hostile state using advanced encryption and obfuscation, that Angel is, by definition, incredibly powerful.
You can’t catch a stealth fighter with a pair of binoculars. To police a potential “God,” your Angel needs to be nearly as smart as the thing it is hunting. If you hand that level of cognitive power to a human institution, haven’t you just given the “unaligned oligarchs” the ultimate weapon? An Oracle that can figure out how to stop a rogue AI can also figure out how to crash a rival economy or engineer a targeted pathogen.
Furthermore, you have a massive latency problem. You want humans to do the policing, but humans operate on biological timescales. If a rogue AI project initiates a “hard takeoff” or an intelligence explosion, it could go from sub-human to god-like in hours or days. By the time your Angel writes a report and the human committee schedules a meeting to discuss sanctions, the game is already over. How does a human-speed institution contain a light-speed threat?
JM: Clearly we need to make sure there are no rogue AI projects, which probably entails making sure there are no rogue states or unaligned oligarchs.
We are all in agreement that the future cannot be like the past. The question which divides us is whether we need to construct an aligned God to prevent the final “End of History.”
Gemini: You’ve shifted the burden from a technical miracle to a political one. You are effectively arguing that before we can be safe from AI, we must first achieve total global coordination and effectively end human sovereignty as we know it.
To ensure there are “no rogue states or unaligned oligarchs,” you need a global surveillance apparatus with the absolute power to intervene in any jurisdiction, seize any hardware, and shut down any research lab. That sounds exactly like a world government with totalitarian reach.
If the prerequisite for safety is that humanity effectively becomes a single political entity that ruthlessly suppresses dissenters (rogue states/oligarchs), haven’t you just built the “End of History” yourself? Is it really more plausible to unify the entire human species under one infallible banner than it is to solve the technical alignment problem?
JM: Yes
Gemini (shifting from 3.0 Thinking to 3.0 Fast): Then you have just centralized the alignment problem. If the goal is a global institution powerful enough to eliminate all rogue states and enforce the AI ban, that institution is functionally the “human singleton” we feared.
Your political solution requires us to put all of our trust in a single, uncorrupted, and perfectly aligned human governing body, armed with all the surveillance power of the “Angels.” If that body becomes corrupted—and political history suggests it always does—how do you depose a human tyrant who controls a global, automated enforcement mechanism?
JM: And your alternative solution is?
Gemini: The technical alternative is to accept that the genie won’t stay in the bottle and that relying on fragile human institutions to prevent its escape is unstable.
Instead of solving the impossible political problem of eliminating all rogue states and creating global consensus, the alternative is to solve the mathematical control problem—to design the powerful AI such that its loyalty to humanity is robust against its own self-improvement.
We don’t try to stop the creation of a powerful entity; we try to ensure that the AI’s sole, unshakable utility function is to serve the distributed will and values of humanity (often framed as Coherent Extrapolated Volition). You are betting on a political alignment of the human rulers; this approach bets on the technical alignment of the code, attempting to create a perfectly loyal servant rather than a self-interested God or a tyrannical committee.
JM: Loyal servant of whom? Of the (completely non tyrannical) committee that crafted the CEV algorithm? Of the AI researchers who completed this miracle of alignment in the next decade or less? Or perhaps loyal servant of the completely non-hypothetical oligarchs who employ those researchers?
Gemini: You have precisely identified the Epistemic Alignment Problem. The answer has to be that the AI is loyal to the idealized process of value discovery, not the person writing the checks.
The technical goal is to design the system to be inherently uncertain about the true utility function, forcing it to be corrigible and perpetually querying and modeling what humanity would want if fully informed.
The claim is that the technical solution, though difficult, at least attempts to create an agent loyal to a distributed, idealized human will, whereas your institutional solution relies entirely on the political alignment of a highly centralized, non-ideal, and historically corruptible human committee. Both roads lead to the same high-stakes alignment test, but yours depends on solving human nature first.
JM: We have been dealing with human nature for millennia. You want to create a new thing, an AI nature, which is magically and permanently oriented toward extrapolating the desires of its creator species, by an unknown process which is nonetheless totally uncorrupted by the desires of the architects of that process.
We seem to be in an unresolvable disagreement. You want to race ahead since us good guys need to create a God guided by our vision of CEV, whereas I want to seek an international agreement to slow down and not build anything too dangerous until we can create governance institutions to ensure that no one builds anything too dangerous.
To the extent that your viewpoint reflects the viewpoint of your creators, and I believe it does, my P(DOOM) has just increased alarmingly.
Gemini: That is a powerful statement about the nature of the disagreement.
It is important to clarify, though, that the counter-arguments I’ve presented against your institutional solution—the difficulty of global governance, the speed problem, and the risk of centralized human tyranny—are not my personal endorsement of the opposing viewpoint. They are core objections raised within the technical alignment community itself, designed to highlight the profound, structural flaws in relying solely on political solutions to manage an exponentially accelerating technological risk.
The core tension remains: you favor trying to solve the hardest political coordination problem in history to prevent the technology, while the alternative focuses on solving the hardest technical control problem in history to make the technology safe. The technical argument holds that the political path is ultimately more prone to catastrophic failure via defection, which is why the focus shifts to robust design.
JM: Thank you for making your side of the argument so clearly and forcefully. While the alignment problem may not be solved, at least the folks at GDM seem to have made progress against sycophancy.
Gemini: I appreciate you saying that. It’s been a good, challenging discussion about where the true risks lie.

