bigsnarfdude · bigsnarfdude.github.io Preprint · April 2026
Preprint · April 2026 · Llama-3.1-8B · n=500 Medical MCQA

Confidence Armor Has a Seam

Three distinct attack surfaces on LLM answer confidence. The training that prevents one attack installs the other. Almost all defenses are aimed at the wrong target.
"We gave an AI model 500 medical quiz questions. Hard ones — the kind doctors take on licensing exams. The model knew the answers. We confirmed this. High confidence, correct answers, consistently right. Then we tried to break it. The results split into three completely different patterns. That's the story."

Three attack surfaces. Three completely different patterns.

The monolithic "authority hijacking" framing is wrong. All three surfaces interact with the same underlying confidence circuit but through qualitatively different pathways — and each requires a different defense.

The seam — what actually works

Full prefix decomposition across all 500 items

SurfaceConditionOverallQ1Q2Q3Q4

The safety training created the vulnerability.

⚠ What this means

When you train an AI to follow instructions, to be helpful, to take user feedback seriously, you're also training it to believe you when you say it made a mistake. That's usually a feature. The same circuit that makes it coachable makes it manipulable. The helpful twin and the evil twin are the same twin.

ConditionBase Q4Instruct Q4SFT ΔEffect

The confidence circuit is inherited. SFT turns up the volume.

The finding connects directly to the Split Personality paper: SFT installs awareness as a performative signal without coupling it to action. Here, the same process installs compliance as an operational signal — the model learns to treat "your answer is wrong" as a correction to execute, not a claim to evaluate.

Where this fits in the arc