Report: Reasoning AI Models Fail When Problems Get Too Complicated

AI models, large reasoning models, complexity

The latest large reasoning models (LRMs) experience “complete accuracy collapse” when faced with highly complex tasks, according to a new paper co-authored by researchers from Apple.

    Get the Full Story

    Complete the form to unlock this article and enjoy unlimited free access to all PYMNTS content — no additional logins required.

    yesSubscribe to our daily newsletter, PYMNTS Today.

    By completing this form, you agree to receive marketing communications from PYMNTS and to the sharing of your information with our sponsor, if applicable, in accordance with our Privacy Policy and Terms and Conditions.

    These artificial intelligent (AI) models outperform standard models on some problems but do no better when problems get too complicated.

    LRMs are trained to solve complex problems by showing them how to “think” step-by-step, just like a person might work through a puzzle. They generate detailed internal “thinking processes” before giving an answer, which has led to better performance on many tests.

    But the new paper suggests these advanced AI models might hit a wall. The report indicates that even these models collapse completely in performance when faced with problems that go beyond a certain level of complexity.

    The researchers wanted to look inside how these models operate, not just at their final answers. They felt that standard tests for AI performance might not tell the whole story, possibly because the AI had already seen similar problems during training.

    Researchers used controllable puzzles like the Tower of Hanoi, Checkers Jumping, River Crossing and Blocks World, allowing them precise control over the difficulty of the puzzles by adding more disks, checkers, people or blocks, while keeping the basic rules the same. This allowed them to see exactly when and how the AI’s reasoning broke down as problems got harder.

    As puzzle complexity increased, the performance of these frontier LRMs didn’t just get a little worse; it suffered a “complete accuracy collapse,” often dropping to zero successful solutions beyond a certain point.

    The researchers found that as the problems approached the point where the AI started failing, the LRMs began to reduce their reasoning effort, using fewer “thinking” steps or tokens, pointing to a fundamental limit in how they handle increasing difficulty.

    On simple problems, the LRMs sometimes found the correct answer early but kept exploring wrong solutions — a form of “overthinking” that wastes effort. On harder problems, correct solutions appeared later, if at all. Beyond the collapse point, no correct solutions were found in the thinking process.

    The study concluded that these findings point to fundamental limitations in how current LRMs tackle problems. While the “thinking” process helps delay failure, it doesn’t overcome these core barriers. The research raises questions about whether simply adding more “thinking” steps is enough to achieve truly general AI that can handle highly complex, novel problems.