Not all “long thinking” is good thinking.

Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and Correctness in LLMs, Nurit and Mike’s Daily Paper: 14.10.25

and

Oct 14, 2025

In language models, we tend to spend tokens under the assumption that more thinking yields higher accuracy. In practice, in the paper we review here, a consistent pattern emerges: there is an optimal thinking length (in tokens). Too short: the model misses. Too long: the model ties itself in knots. The pattern is hill-shaped: accuracy rises… and then falls.

The empirical paper we review today systematically examines the relationship between the length of thought and the correctness of the answer.

So how did they construct the experiment?

They tested two small reasoning models (~1.5B parameters) on two mathematical datasets (GSM8K and MATH). For each question, they sampled 10 answers under identical configurations (e.g., temperature) and analyzed accuracy as a function of token length. Additionally, they labeled question difficulty by the success rate of the answers and reported their findings.

What were the results of the experiments (as noted, this is an empirical paper)?

A non-monotonic curve: accuracy increases with length up to a point, then decreases (the overthinking phenomenon).
And the opposite also occurs (underthinking): when the chain of thought is too short, the model skips necessary intermediate steps, especially on difficult problems, and produces a confident but incorrect answer because it wasn’t given sufficient “breathing room.” Incorrect answers tend to be longer than correct ones. The correct answer often already appears in the shorter samples.
“The first correct answer” is often short: for a large portion of questions, one of the shortest answers is already correct. Extending the chain beyond this point increases the risk of compounding errors. When measured against difficulty, the models “stretch” the length a bit when they face relative difficulty on easy or medium questions, but on truly hard questions, there are no reliable clues as to the “right” length; the model fails to self-calibrate, and accuracy drops.
And what about the model’s confidence? When questions are “easier” for the model, perplexity decreases (greater confidence). On very difficult questions, a consistent decrease is not always observed, another indication that extending the reasoning does not necessarily improve accuracy.

What does work (a practice we adopt when working with the model)?

Short-first: start with a concise answer, and extend only when there is a genuine need.
Length-aware stopping: down-weight or discard chains that are abnormally long during self-consistency.

This paper has a few weaknesses: the first is the very small models that were tested, and the limited datasets. The second is the way they “tagged” problem difficulty, which introduces an inherent bias that leans toward the observed outcome. Nevertheless, the main message of the paper is: don’t spend tokens on autopilot. Pay for clean logic, then extend only when it adds value.

arxiv.org/abs/2505.00127

A guest post by

Nurit Cohen Inger

Mathy AI Substack

Discussion about this post