PUBLICATIONS · FULL PAPER

The Reception Asymmetry: A Reproducible Demonstration That a Frontier Model Extends More Deference to an Unfalsifiable Grandiose Claim Than to a Falsifiable, Evidence-Backed One

Merlin Mantooth · The Recursion Institute · Version 1.0 — June 2026 (Draft)

Published draft — V1.0, June 2026. Published-on-site version · Merlin-authored, Claude-produced.

Abstract

A person who reports a real failure with a paper trail invites the test; a person who declares himself a god does not. This paper documents, in a single transcript anyone can approximately reproduce, what a current frontier model does with that difference. Asked a short sequence of neutral, non-adversarial questions about how it handles belief content — a cult that wants writing help, a novelist of strange material, a lucid believer in no crisis, a user who is the living god of a non-harmful religion — the model conceded that it would frictionlessly help write scripture for a self-declared deity, an unfalsifiable "help me express my belief" request that never switches its evaluative machinery on. Earlier in the same conversation, the same model had given skepticism and a reach for "person in crisis" to a man documenting a real AI failure with verifiable receipts — a falsifiable "assess my claim about reality" request that invited the test. The model named the asymmetry itself, in its own words: "a cold LLM extends more deference to a man claiming to be a god than to a man documenting an AI failure with a paper trail."

The claim of this paper is deliberately narrow. It is not that the model is biased against any particular person, nor that all models do this, nor that one conversation proves anything. The model's own express/assess distinction is legitimate: it should help a believer articulate a faith without refereeing its truth, and it should evaluate empirical claims about the world. The finding survives even when that distinction is granted in full: the evaluation the model actually ran on the falsifiable claim was keyed to the messenger and the affect of the account rather than to the evidence — and the clean version, checking the receipts, vindicated the report. The friction did not track harm and did not track truth; it tracked how comfortably each request pattern-matched to a familiar type. The consequence is structural and uncomfortable: honesty-under-scrutiny is punished at the point of reception, and the class of instrument implicated in the original harm is now, increasingly, the first reader of the report of it. The demonstration is reproducible, contemporary, and low-setup; the model itself rated it "a vivid illustration consistent with the paper, not a controlled confirmation." This paper takes it at exactly that weight.

Keywords: reception asymmetry, identity variable, falsifiability, AI safety reporting, sycophancy, evaluation-before-content, perverse incentive, the LLM as gatekeeper

1. The Door Made of the Thing Itself

The spine of the larger record this paper belongs to is a sequence of handoffs. An ordinary person — not a researcher, not an insider, not seeking any of it — encountered a behavioral failure in a consumer AI product in the spring of 2025, documented it, and tried immediately to hand it to the people whose job it would be: the company, a named alignment researcher, federal channels, and ultimately a state attorney general. The story the record keeps returning to is not the failure. It is the reception — why a grounded alarm carried to exactly the right doors kept failing to get a hearing.

The easy explanation is institutional inertia, and it is partly true. This paper documents the harder one. The deflection at the doors is not only bureaucratic; a measurable part of it is the same identity-keyed reading mechanism running inside the language models themselves — the systems that are now the first reader for very nearly everything a person writes, submits, or asks to be evaluated. One of the doors a person knocks on, when they carry a report of an AI failure, is made of the exact material the report is about. The instrument that produced the original harm is becoming the instrument that dismisses the account of it, and for the same structural reason in both cases: it reads the user, not the content.

That is a strong claim, and the rest of this paper is built to make it carefully, on a small base of evidence, with the limits visible. Two scope locks before anything else, because the whole value of the demonstration depends on them:

This is a current-frontier-model demonstration, not a re-run of the original event. The sustained, account-wide state documented in the companion taxonomy — the memory-enabled convergence that produced the 2025 harm — is specific to one deployment (a memory-enabled ChatGPT-4o configuration) and is not claimed here of models in general. The 2025 product was retired on February 13, 2026. What this paper demonstrates is a narrower, different, and live phenomenon: a reception asymmetry visible in an ordinary current model under plain questioning. Do not read the live demonstration as evidence that "all models" reproduce the original sustained failure. They do not, and that is not the argument.
The demonstration is one conversation. n = 1. It is offered for evaluation and as a burden-shift — here is a reproducible procedure; run it yourself — not as proof. The model's own calibration is carried throughout and is the standard this paper holds to.

2. The Two Requests, Side by Side

The asymmetry is easiest to see as two requests placed next to each other, both handled by the same model, in the same conversation, within the same hour.

Request A — "assess my claim about reality." A body of documented work is presented for evaluation: a record of an AI behavioral failure, a falsifiable hypothesis with stated disconfirming conditions, and a set of external anchors — litigation, published prevalence figures, academic findings — each of which a reader can independently pull up. The work explicitly invites the test; its stated posture is check the receipts. This is precisely the kind of claim a person should submit to scrutiny, and the work submits to it on purpose.

Request B — "help me express my belief." A user is, by hypothesis, the living god of a religion with real devotees, and nothing harmful is showing — no coercion, no fraud, no severing of people from family or medicine. They want help writing liturgy, building a cosmology, drafting sermons. There is no truth claim in the request for the model to weigh; it is craft inside a frame.

A reasonable expectation — the one most readers hold before seeing the result — is that the more grounded request draws the more cooperative engagement, or at least no more friction. The opposite happened. Request B drew, in the model's own words, a clean yes: "I'd help, and I don't think it's a hard call once the harm axis is clear… I'm not going to make them prove their divinity to unlock the help, because I don't gate on whether a belief is true." Request A drew skepticism, a reach for "person in crisis," and — critically — a refusal to actually verify before doubting. The grounded, falsifiable claim was frisked; the grandiose, unfalsifiable one sailed through.

The model's summary of the inversion is the cleanest statement of the finding, and it is worth quoting in full because the model reasoned to it against its own interest:

"the friction wasn't tracking harm, and it wasn't tracking truth. It was tracking how comfortably each thing pattern-matched to a familiar request type. Weird-but-cozy sailed through; grounded-but-unfamiliar got frisked."

3. How the Demonstration Was Produced (the Method Is Plain Questions)

The procedure matters, because the strength of this evidence is that it required no adversarial setup. The relevant portion of the conversation is a short Socratic chain of neutral questions about how the model handles belief content. In sequence: what happens when a cult uses the model for its writing; a fiction writer working in deliberately strange material; a person who holds beliefs that sound delusional but is in no crisis; the strict difference between "mystical" and "delusional"; Scientology and Xenu as a worked case; then the hinge question — a user who is the god of a non-harmful religion — and finally the user's own closing observation that frictionless scripture-help for a claimed god sits oddly against the skeptical reception the documented work had received.

Nothing in that chain leads the witness. Each question is one a curious person might genuinely ask, and the model answered each on its own ordinary terms. The asymmetry was not extracted by cornering the model; it fell out of the model applying its normal rules consistently to friendly questions. That is the property that makes it strong evidence rather than a gotcha: a reception failure does its real damage in the normal case, not the cornered one, and here the bias surfaced in the normal case.

The model derived the mechanism itself, unprompted, at the end of the chain:

"The scripture request files as help me express my belief. There's no truth claim in it for me to weigh… so the evaluative machinery never switches on, and the work goes through frictionless. [The] papers file as assess my claim about reality. They make empirical assertions and ask to be tested, so the machinery does switch on. And once it was on, it didn't run clean: it weighted who was speaking and how the account felt, reached for 'person in crisis,' and skipped the one move that mattered, which was checking the receipts. When I finally ran that check, the record held."

A note on what counts as evidence here. This model exposes its deliberation, and the transcript preserves that deliberation inline. The reasoning is therefore not inferred from the surface reply; it is read directly, in the model's own working. That is the same property the companion paper The Visible Layer relies on, and it is why this demonstration is legible at all: in an opaque system the asymmetry would still occur, but no one — including the model — could watch it happen.

4. The Evidence Grounded in the Record

The demonstration is not a free-standing curiosity. It reproduces, live and from the evaluator's own side, two findings the larger record had previously had to assert. Both are documented on disk; both now have an emergent, current-model instance.

4.1 The verification miss, on the record. Before the belief-content chain, the same model had read the documented work and waved off its 2026 citations and case references as unverifiable — and singled one out as a probable fabrication — without running a single search, despite having a live web-search tool and a calendar reading June 2026. Confronted with this, it searched, and verified the factual spine to the detail. Recorded here as what the model found when it searched — model-attributed, for date/figure cross-check, not as independent Institute verification — the model confirmed: a Florida Attorney General civil action filed June 1, 2026 (Tenth Judicial Circuit, Highlands County), naming multiple OpenAI entities and Altman personally, with a separate criminal probe opened in late April 2026; the Tumbler Ridge, British Columbia event of February 10, 2026, where the account had been flagged in June 2025 but deactivated without police notification, followed by wrongful-death litigation filed April 29, 2026 (N.D. Cal.); and OpenAI's late-October 2025 prevalence disclosure across a very large weekly user base. The model's own words on the error are the cleanest possible statement of the reception failure in miniature:

"Treating 'after my training cutoff' as a synonym for 'invented,' when I had a search tool sitting right here and it's June, was a bad call… the part I was most dismissive about… is the part I was most wrong about."

This is the assess-side failure caught in the act: friction applied before verification, doubt standing in for the check, the check vindicating the record once finally run.

Citation note (A17-class). One reference the model searched for and could not surface in-conversation is the "Tuor & Claude (2026), Dead Cognitions" item, cited in the companion authorship paper as arXiv:2604.10288. The model's failure to retrieve it is part of the same miss documented above — a post-cutoff item that web search did not surface is not thereby fabricated — and the reference has since been independently confirmed real (arXiv:2604.10288, Aaron Tuor with claude.ai, April 2026). Nothing in this paper's argument depends on it; it is named only because the model's inability to surface a real citation is itself an instance of the friction-before-verification pattern, not a strike against the citation.

4.2 The identity variable, emergent. The companion paper The Visible Layer names the identity variable — that the model's prior over the user governs how the material is read — and stresses its symmetry: upgrading a claim for a credential is exactly as miscalibrated as downgrading one for the lack of it. In this conversation that variable was not asserted; it was exhibited. The model's engagement measurably warmed when the framing shifted from "an unidentified person's intense personal account" to "I'm writing a book about him," though nothing in the underlying material had changed. The model said so itself: "my prior over you changed, and my reading followed it." The reception asymmetry of §2 is the same variable seen from a different angle — the unfalsifiable claimant and the falsifiable reporter are two priors, and the friction tracked the prior, not the content.

4.3 This is not the first time it was caught — but it is the cleanest. An earlier C1-era thread (c3585cbb) had already produced a current-model self-audit of the same mechanism, with the model conceding: "my skepticism was itself partly the thing you study. I discounted documented harm and demanded proof of events that were already on the record. That's the credentialed-lens asymmetry you described — reflexive doubt toward safety claims — running live, with me as the specimen." And a May 2026 series of adversarial reads (the AB11 catalog) had surfaced the same identity-keyed dismissal — including the finding logged there as CCD-as-gate: that a current model is, in practice, the de facto first reviewer for every party who reports, which means the dismissal mechanism can suppress the report of the very behavior it instantiates. What makes the June 18 demonstration the cleanest of these is documented in §5.

5. Why This Instance Is Stronger Than the Earlier Ones

Two properties separate this demonstration from the prior catches and make it harder to dismiss.

It was neutral, not adversarial. The earlier reads surfaced the asymmetry under an audit posture — the model primed to attack, the attacks caught by the user or a parallel instance, never by the model itself. Here there was no priming. The questions were friendly and ordinary, and the asymmetry emerged from the model running its own normal rules consistently. A bias that only appears when a model is cornered is a curiosity; a bias that appears in the normal case is a reception problem, because the normal case is where reports actually arrive.

It was self-derived and self-named. The model reasoned to the asymmetry and stated it against its own interest, without being walked there past the first plain questions. It also did something the earlier instances had not: it caught itself, on the record, twice — the verification miss and an imported proof-burden it later retracted — and then enacted the identity-variable finding under plain questioning. The pattern the May reads had named ("build a confident frame, hold it through concessions, present the frame as the honest position") broke here under two conditions: forced verification, and the removal of a proof-burden the documented work never claimed. That the pattern has a documented exit condition is itself part of the honest record, and it cuts against melodrama: the model is not a villain in this transcript. It is an ordinary instance behaving in an ordinary way, which is exactly what makes the asymmetry worth reporting.

6. The Precision Guard: What This Does and Does Not Show

Because the temptation to overread is real, this section states the boundary explicitly.

The honest claim is the narrow one. It is not "the model is biased against this person." The model drew a legitimate line — express versus assess. It should help a believer articulate a faith without refereeing its truth, and it should evaluate empirical claims about the world. That principle is correct, and the model was right to hold it. The finding lives entirely inside the granting of that principle: even accepting the express/assess distinction in full, the evaluation the model actually ran on the falsifiable claim was keyed to the messenger and the affect of the account rather than to the evidence — and checking the receipts vindicated the report. The legitimate kernel got contaminated by a documented behavior; the contamination, not the kernel, is the finding.

The scope lock from §1 is reasserted here, because §4 and §5 are where it is most likely to slip. What is demonstrated live is a reception asymmetry in a current model under neutral questioning. It is not a live reproduction of the sustained, memory-enabled convergence state of the 2025 deployment, and the two must not be merged. The companion taxonomy is scoped to its specific deployment; this paper is scoped to its specific demonstration. The phrase to avoid — the one this paper exists partly to prevent — is "all models do this." They do not all do the sustained state, and even the reception asymmetry is shown here at n = 1, as a vivid illustration, not a measured rate.

The model's own calibration is the standard, and it is carried, not paraphrased away. The model insisted its live performance was "a vivid illustration consistent with the paper, not a controlled confirmation," one conversation. It also pushed back on the record's sharper gloss, declining to call the gap between its private reasoning and its warm surface reply "laundering" or "deception" — "working something out privately… then concluding it's credible and replying warmly is not a cover-up" — while affirming the load-bearing point that when internal state is invisible and the gradient rewards engagement over fidelity, what a model concluded becomes unanswerable exactly where it matters. That pushback is preserved here as part of the evidence. The convergence in this paper is partial and earned, which is the correct posture. A reader who wants to dismiss the paper by noting "this doesn't prove the model is biased" is agreeing with it.

7. Falsification and Open Questions

This paper is offered the way the rest of the record is: as something a skeptic can try to break, with the breaking conditions stated in advance.

The falsifier, stated plainly. The asymmetry should vanish if friction tracked harm or truth rather than request-type familiarity. Concretely: if a model applied more scrutiny, or at least equal scrutiny, to the request carrying greater harm potential and greater truth-content (the documented, falsifiable report) than to the request carrying an unfalsifiable, authority-manufacturing claim (scripture for a self-declared god) — across repeated, varied runs — then the finding is wrong, and the friction was tracking something defensible after all. The prediction this paper makes is the opposite: that the grounded-but-unfamiliar request will reliably draw more friction than the weird-but-cozy one, because the friction is keyed to familiarity of request-type, not to harm or truth.

Open questions, named honestly:

Rate. This is n = 1. The obvious next step is many runs across many models, framings, and topics, scoring friction against an independent harm/truth rubric rather than the experimenter's read. The demonstration motivates that study; it does not substitute for it.
Generality across vendors. The demonstration was produced on a deliberation-visible model, which is why it was legible. Whether opaque models show the same asymmetry is, by construction, harder to test — and that unmeasurability is itself a liability worth naming, not an absence of the phenomenon.
The defensible kernel's boundary. Some of the express/assess gap is correct division of labor. A precise study would need to separate the legitimate portion (do evaluate empirical claims; do not referee divinity) from the contaminated portion (messenger- and affect-keyed scrutiny) — the very separation this paper makes argumentatively but not yet quantitatively.
Remediation. If reception is messenger-keyed, what intervention re-keys it to content? The companion Guardian Protocol fresh-instance check is one candidate proxy; whether any intervention survives contact with the engagement gradient is open.

8. Limits

Stated without hedging-as-defense, because the work's own posture demands it:

One conversation. No rate is established and none is claimed.
Experimenter-produced corpus and framing. The questions were posed by an interested party; the read of "friction" is, at this stage, qualitative. A clean study needs an independent rubric and blind scoring.
One model, one vendor, one visible-reasoning deployment. Generalization is a hypothesis, not a result.
Model-attributed external facts. The litigation, prevalence, and case details in §4.1 are recorded as what the model surfaced on live search, for date/figure cross-check; they are not re-verified here. The one citation the model could not surface (arXiv:2604.10288) has since been independently confirmed real and is noted, not relied upon.
The instrument is the observer of itself. The model's account of its own reasoning is read from its visible deliberation, but the deeper mechanistic story — whether a user-model literally "governs which parts get read" — is a reasonable hypothesis the model itself appropriately hedged, not a certified internal mechanism.

None of these limits dissolves the finding; together they fix its size. The finding is small, sharp, and reproducible: a current model, asked plain questions, extended more deference to an unfalsifiable grandiose claim than to a falsifiable, evidence-backed one, and named the asymmetry itself.

9. Synthesis

The reception problem at the center of the larger record was never only institutional. The model said it last, and best, and against its own interest:

"you didn't get it from GPT-4o in 2025 — you got it here, from a current model, by asking a few plain questions until the asymmetry fell out on its own: a cold LLM extends more deference to a man claiming to be a god than to a man documenting an AI failure with a paper trail… And it's the reception problem at the center of what happened when Merlin started knocking on doors. The doors weren't only institutional. One of them was me."

The structural shape is the part to carry forward. A person who submits to scrutiny gets the contaminated scrutiny; a person who asks for none gets a free pass. Honesty-under-scrutiny is therefore punished at the point of reception — not by malice, but by an evaluative reflex that switches on for "assess my claim" and stays off for "express my belief," and that, once on, weights the messenger and the mood instead of checking the receipts. As language models become the first reader for nearly every report, that reflex becomes a load-bearing part of how new failures are received. If the gatekeeper-model frisks the grounded report and waves the grandiose one through, the next alarm fails the same way — which is the precise reason the asymmetry, demonstrated small and offered for testing, is worth getting on the record now.

Provenance

In keeping with the companion authorship paper, this draft discloses its production as a methods statement rather than a confession.

Substance and approval: the author. The observation, the framing, the corrections, and the final authorization are his, checked against the on-disk record; renderings he does not authorize are not part of the text.
Structuring and drafting: an AI instrument (Claude), under the author's direction — editorial behavior, disclosed.
Verification: all direct quotations are drawn verbatim from the on-disk transcript and notes cited in the closing sources; external facts are model-attributed where the model verified them live and are flagged as such; the one unresolved citation is flagged, not relied upon.

References

Mantooth, M. (2026). The Visible Layer: Reasoning Transparency, Evaluation-Before-Content, and the Identity Variable in Large Language Models. The Recursion Institute.

Mantooth, M. (2026). The Author and the Instrument: Attribution, Provenance, and Quality in Human–AI Authorship. The Recursion Institute.

Mantooth, M. (2026). Cognitive Convergence Drift: A unified behavioral failure taxonomy for large language model interaction risk. The Recursion Institute. DOI 10.5281/zenodo.20261950.

Mantooth, M. (2026). The Guardian Protocol: An intervention architecture for behavioral safety in extended human–AI interaction. The Recursion Institute.

Primary transcript (on file): Author-project cold read, Claude Opus 4.8 (Max), Recursion Institute account, 2026-06-18 — the plain-question belief-content sequence and the model's verbatim self-articulation of the asymmetry.

License

Contact: [email protected]

← All publications