Think-on holds; think-off leaks.
Native think-off controls compress Boolean verification most reliably, are inconsistent on multiple choice, and fail almost completely on open-ended tasks.
Project page
No-thinking controls are not a uniform switch: they compress Boolean verification, weaken on multiple choice, and fail most visibly on open-ended generation.
Overview
We split each model response into visible pre-answer text T and final answer A. A model is closer to no-thinking only when it preserves task accuracy while exposing little or no question-conditioned payload before the answer.
Task accuracy of the final answer.
Empty-thinking ratio for visible pre-answer text.
Semantic relevance between the question and visible payload.
Core result
Native think-off controls compress Boolean verification most reliably, are inconsistent on multiple choice, and fail almost completely on open-ended tasks.
Stricter answer-only constraints may raise ETR. on open-ended tasks, but can also remove work needed for accuracy.
Rewriting the same math questions into Boolean, MCQ, and open-ended forms changes how much visible payload remains.
Experiments
Resources
Contact: kevin.qh.lin@gmail.com
Code & Demo: github.com/LeiDQ/ThinkZero
Paper: coming soon
Website: leidq.github.io/ThinkZero/
Citation
@misc{lei2026llmskeepthinking,
title = {LLMs Keep Thinking When Told Not To},
author = {Dianqiao Lei and Kevin Qinghong Lin and Pan Lu and Philip Torr and James Zou},
year = {2026},
note = {Preprint. Citation details coming soon.}
}