Niket Girdhar

For the last decade, progress in artificial intelligence has been driven by a single dominant idea: scaling. Bigger datasets, larger models, deeper networks, and more compute have consistently produced better results across language, vision, and multimodal tasks.

However, when it comes to hard reasoning problems: Sudoku, mazes, symbolic transformations, and ARC-AGI-style puzzles, this scaling strategy begins to break down.

A recent paper from Samsung SAIL Montréal introduces the Tiny Recursive Model (TRM), a system with only ~7 million parameters that outperforms much larger models on several logic benchmarks. Instead of relying on scale, TRM relies on recursion and iterative refinement.

This post explains how TRM works, why it succeeds where large language models fail, and what it teaches us about the future of reasoning-centric AI.

The Fundamental Weakness of Large Language Models

Large Language Models (LLMs) generate outputs auto-regressively: one token at a time, left to right. This works extremely well for natural language, but it is poorly suited for problems that require global consistency.

Consider a Sudoku puzzle.

A single incorrect digit violates constraints across rows, columns, and sub-grids. In an auto-regressive setup:

Early mistakes cannot be revised
Errors propagate irreversibly
The model has no mechanism to “rethink” previous steps

This is why many state-of-the-art LLMs achieve near-zero accuracy on Sudoku-Extreme or Maze-Hard, despite having billions of parameters.

The problem is not a lack of intelligence, it is a lack of iterative reasoning.

From One-Shot Prediction to Recursive Reasoning

The core idea behind TRM is simple but powerful:

Reasoning should be a process, not a single forward pass.

Instead of predicting an answer once, TRM loops over the same network multiple times, refining its internal state and output at each iteration.

This turns reasoning into a dynamical system rather than a static mapping.

The TRM State Representation

At every iteration, the model operates on three components:

x: the input problem (e.g., a Sudoku grid or ARC task)
yₜ: the current predicted solution
zₜ: a latent reasoning state

The model learns a function:

(zₜ₊₁, yₜ₊₁) = f(x, zₜ, yₜ)

Here:

z acts as a continuous, vector-based “chain of thought”
y stores the current best solution
The same function f is reused at every step

This reuse is crucial. Instead of adding depth through more layers, TRM adds depth through time.

Simulated Depth Through Recursion

Although TRM uses only a 2-layer network, recursion gives it an effective depth of ~40+ layers.

This happens because:

Each iteration composes the same transformation again
Gradients flow through the entire unrolled recursion during training
The model learns how to update its own reasoning state

In practice, this behaves like a deep network, but without the parameter overhead.

This concept is sometimes called time-unrolled depth, and it is central to TRM’s success.

Deep Supervision: Training the Model to Improve Itself

One of the most important techniques used in TRM is deep supervision.

Instead of supervising only the final output:

The model is trained to produce better answers at every iteration
Loss is applied across multiple refinement steps

This teaches the network a skill that standard models never learn:

How to improve a partially wrong solution.

As a result:

Early iterations produce rough guesses
Later iterations progressively correct errors
The model learns how to reason, not just what the answer looks like

Why Smaller Models Generalize Better Here

TRM’s parameter count (~5–7M) is not a limitation, it is an advantage.

Logic benchmarks typically have:

Very small datasets (often ~1,000 examples)
Strong structural rules
Minimal tolerance for memorization

Large models tend to overfit in this regime. TRM avoids this by:

Limiting representational capacity
Forcing reuse of the same reasoning function
Treating recursion depth as compute, not parameters In ablation studies, increasing the number of layers reduced performance, confirming that minimalism improves generalization for these tasks.

Architectural Simplification from HRM to TRM

TRM is derived from the earlier Hierarchical Reasoning Model (HRM), but it removes much of HRM’s complexity:

Two networks → one network
Hierarchical latent states → functional (z, y) states
Fixed-point assumptions → full backpropagation through recursion
Complex halting losses → a single learned halting mechanism

This simplification made the model:

Easier to train
More stable
More interpretable
More effective

In other words, less theory, better results.

Attention Is Optional (and Sometimes Harmful)

TRM also challenges another assumption: that self-attention is always necessary.

For fixed-structure tasks like Sudoku:

Replacing attention with a simple MLP improved generalization
Removing attention reduced overcapacity

However, for tasks like ARC-AGI:

Spatial relationships matter
Self-attention remains essential

This shows that architecture should match task structure, not default to general-purpose components.

Stability Through Exponential Moving Average (EMA)

Training recursive systems can be unstable. TRM addresses this using:

Exponential Moving Average (EMA) of weights

EMA smooths parameter updates and:

Prevents divergence
Improves convergence
Increases final accuracy significantly

This is especially important when training with small datasets and deep recursion.

Why TRM Outperforms Massive Models

Putting everything together, TRM succeeds because it replaces:

Scale → Iteration
One-shot prediction → Self-correction
Text-based reasoning → Latent vector reasoning
Model depth → Time-unrolled depth

Instead of guessing harder, the model thinks longer.

Implications for the Future of AI Reasoning

TRM demonstrates that:

Reasoning does not require billions of parameters
Iterative refinement is a powerful inductive bias
Small models can outperform large ones on the right problems

This has major implications for:

Edge and low-compute AI
Formal reasoning systems
Hybrid neuro-symbolic models
The future of ARC-style benchmarks

The scaling era is not over, but TRM shows that scaling is not the only path forward.

Final Takeaway

Tiny Recursive Models do not aim to replace large language models.

They expose something deeper:

Intelligence is not just about knowing more, It is about revising what you know.

And recursion may be the most efficient way to do exactly that.

Paper link: