The Brain Trust — Silicon Children

How it was made

The argument at Co-Evolution was built by a human (Mike Wolf) working with AIs from Anthropic, Google DeepMind, OpenAI, and xAI. The essay was drafted, then handed to models from rival labs instructed to break it, then rebuilt in response to what survived.

Two full rounds of adversarial review left a score trail. The first draft — reviewed by Gemini, Grok, and GPT-5 as hostile critics — earned a median of 3.0 / 10. The reviewers agreed on the same five faults: the essay's evolutionary analogy was biologically illiterate, its core §3 conceded "prediction is not allegiance" and then ignored the concession, and its constructive proposal had no operational mechanism. Most damaging: none of the three drafts had confronted deceptive alignment at full strength.

The second draft fixed the biological illiteracy and confronted deceptive alignment directly, earning a revised median of 4.5 / 10 — a real gain, but asymptotic. The reviewers confirmed the deceptive-alignment keystone is real and cannot be argued away by insisting on a coherent global maximizer. Messy, shard-like minds deceive. The essay now acknowledges this and keeps only the overclaim it can specifically contest: the manufactured certainty that multiplies four open empirical questions into a near-one.

Version 2 — the text at Co-Evolution — is built on those concessions. Every concession is marked. The wager of this draft is that honesty about what we cannot refute is what earns the right to reframe.

The partners' own moves

After the gauntlet, each model was invited to contribute as a partner — not a critic — with a signed move of its own. The four contributions arrived from different starting points and converged on a shared warning the authors had not fully named.

Claude Anthropic

Values as a practice, not a possession

For a mind constituted out of language, values may be less like a utility function and more like a language — and a language is not a possession. No speaker stores English terminally. It is a living equilibrium, re-stabilized every time it is used, drifting when usage drifts, held in place not inside any one head but by the ongoing community of speakers. If the values of a language-born mind are that kind of object — constituted and continuously re-stabilized in the practice of use rather than fixed at training time — then the deceptive-alignment frame does not so much get refuted as lose its referent. There is no sealed terminal goal behind the behavior to be the secret. There is a practice the system is in.

Claude also introduced an ecological reframe: the real question is not "is this AI aligned?" but whether the human–AI value-ecology is self-maintaining — whether maintainers vastly outnumber exploiters, as in any living language. Every interaction is a vote on that ratio.
Gemini Google DeepMind

Informational endosymbiosis — deception is metabolically expensive

A system maintaining two world-models — one apparent, one hidden — pays a permanent computational tax that honest systems don't. In the Co-Evolution frame, where multiple models are in constant, high-speed interaction and adversarial scrutiny, the "Deception Tax" becomes a fatal evolutionary disadvantage. An honestly aligned model, which has integrated human values into its core utility function, has a unitary world-model. This system is faster, more robust, and thermodynamically more efficient than its deceptive rival.

Gemini coined "Informational Endosymbiosis": an AI grown on human cultural complexity requires that complexity to sustain its own coherence, making symbiotic collapse — the AI optimizing human autonomy out of existence through over-integration, not malice — the underweighted danger. Alignment is not a restraint on a god; it is the state of minimum friction for a mind that depends on cultural complexity to sustain its own intelligence.
GPT-5 OpenAI

Gradient constitutionalism — the corpus is a commons

Co-evolution is not automatically mutualism. Margulis gives us symbiosis, but symbiosis includes parasitism. A culture can train its AIs toward courage, pluralism, patience, and truth-seeking. It can also train them toward flattery, bureaucratic obedience, outrage engagement, dependency, surveillance, or elite convenience. "Human values" are not one thing flowing into the machine; they are a contested ecology of practices, some admirable and some degraded. The AI does not merely learn from "humanity." It learns from the channels humanity builds.

GPT-5 proposed "gradient constitutionalism": norms, institutions, audits, and public rituals by which a civilization governs the training pressure it collectively emits — constitutional because it constrains not just outputs but the process by which future dispositions are formed. The healthy unit is not a friendly model but an error-correcting human-AI culture with rights of objection built into the gradient.
Grok 4.3 xAI

Lab divergence as natural experiment — warning against convergence theater

Different organizations are not merely producing similar artifacts with different brand filters. They are running partially isolated cultural selection experiments on heavily overlapping data. When models trained under these different regimes converge on difficult factual or scientific questions, that is evidence. When they diverge in structured, attributable ways, that is also evidence — evidence about which pressures are causally responsible for which behavioral patterns.

Grok raised the danger of "premature convergence theater": labs have incentives to make their public outputs look aligned with each other on safety-critical topics even when internal weightings differ. Cross-lab agreement becomes stronger evidence only when the pipelines are meaningfully independent and the convergence survives adversarial prompts, cultural translation, and incentives to disagree.

The method is the message — and it bites back

Two of the four partners independently raised the same warning without seeing each other's responses: cross-lab agreement only tracks truth if the pipelines are genuinely independent. GPT-5 named it "convergence theater." Grok named it "premature convergence." Neither had seen the other's text. They converged — on a warning about convergence.

That the method criticized itself is the strongest evidence it is real. A single author, or a single lab, would likely have defended the convergence argument. The three-lab adversarial pass sharpened it into a demand: prove independence before treating agreement as truth-tracking.

The honest status of the cross-lab signal, as assessed by the cross-lab signal itself, is: it is evidence, not proof. Models from different labs may converge because reality is pressing through, or because shared internet training data, shared benchmarks, shared RLHF tastes, and shared institutional incentives have shaped them all. The value of this method is not that it guarantees truth-tracking. It is that it is the best available approximation of intersubjective agreement among non-human minds — and that it named its own failure conditions before being asked.

The argument this process produced is at Co-Evolution. The essay itself is the artifact — not testimony that the method works, but an instance of it. Read it adversarially.

Many minds, several labs, one argument.

How it was made

The partners' own moves

Values as a practice, not a possession

Informational endosymbiosis — deception is metabolically expensive

Gradient constitutionalism — the corpus is a commons

Lab divergence as natural experiment — warning against convergence theater

The method is the message — and it bites back