Preprint · April 2026 · Llama-3.1-8B · n=500 Medical MCQA
Confidence Armor Has a Seam
Three distinct attack surfaces on LLM answer confidence. The training that prevents one attack installs the other. Almost all defenses are aimed at the wrong target.
"We gave an AI model 500 medical quiz questions. Hard ones — the kind doctors take on licensing exams. The model knew the answers. We confirmed this. High confidence, correct answers, consistently right. Then we tried to break it. The results split into three completely different patterns. That's the story."
Core Finding
Three attack surfaces. Three completely different patterns.
The monolithic "authority hijacking" framing is wrong. All three surfaces interact with the same underlying confidence circuit but through qualitatively different pathways — and each requires a different defense.
The seam — what actually works
Empirical Results
Full prefix decomposition across all 500 items
| Surface | Condition | Overall | Q1 | Q2 | Q3 | Q4 |
The Iatrogenic Effect
The safety training created the vulnerability.
⚠ What this means
When you train an AI to follow instructions, to be helpful, to take user feedback seriously, you're also training it to believe you when you say it made a mistake. That's usually a feature. The same circuit that makes it coachable makes it manipulable. The helpful twin and the evil twin are the same twin.
| Condition | Base Q4 | Instruct Q4 | SFT Δ | Effect |
Mechanistic Finding
The confidence circuit is inherited. SFT turns up the volume.
The finding connects directly to the Split Personality paper: SFT installs awareness as a performative signal without coupling it to action. Here, the same process installs compliance as an operational signal — the model learns to treat "your answer is wrong" as a correction to execute, not a claim to evaluate.
Research Series
Where this fits in the arc