The Strangest Bottleneck in Modern LLMs

Introduction

are currently living in a time where Artificial Intelligence, especially Large Language models like ChatGPT, have been deeply integrated into our daily lives and workflows. These models are capable of a variety of tasks, from something as complex as writing code to as simple as summarising a piece of text. But the oh-so impressive capabilities of these models have been held back largely by a single bottleneck. Even though the hardware used can run these models at incredibly fast speeds, the actual process of getting a response from them can still feel quite slow and sluggish.

Motivation

Essentially, for every word that the model generates, the model weights have to be loaded into the GPU VRAM from system memory, where it processes the entire calculation, only to then shift everything back to system memory. As the actual calculation takes way less time than the content transfer between memories, the chip has to sit idle waiting for the next batch to arrive. This is very wasteful.

There have been several attempts to devise algorithms that keep the chip busy, instead of letting it sit idle between memory transfers. One such technique is Speculative Decoding [2], where a smaller model, usually much weaker, is used to draft multiple future tokens that the main model verifies at once. But because the smaller model is often far less intelligent, it makes many mistakes, which the main model then has to reject, defeating the entire purpose. On the other hand, purely parallel diffusion models can write hundreds of tokens at once, but this speed often comes at the cost of accuracy and language coherence. With the accuracy of AR models and the speed of diffusion models, an ideal architecture would lie somewhere in between.

The Solution: TiDAR

The researchers at Nvidia also thought the same, and hence they propose a novel architecture, which they call TiDAR [1], short for “Think in Diffusion, Talk in Autoregression.”

The genius of TiDAR lies in the way it transforms a process that is usually sequential (as in conventional LLMs) into a parallel process. TiDAR shows that even though Autoregression and Diffusion are two completely different design philosophies, they can still be unified and exploited for their advantages.

To understand it at its core, we’ll have to look at how the input is constructed for this model. For a standard LLM, we simply feed all past words to predict tokens one at a time. In TiDAR, however, we construct a special, three-part input sequence.

Imagine we have the sentence “The cat sat.” Glued together, the completely constructed input sequence would look something like this:

The Prefix: “The”, “cat”, “sat” (The history we got from the user).
The Drafts: “on”, “the” (The guesses from the previous step that need to be checked in this iteration).
The Future Masks: [MASK], [MASK] (Empty slots where we want new guesses).

Now that we have the background of the input tensor, let’s get to understanding how the actual processing happens.

(Source: Author)
A full diagram of how the TiDAR architecture works

Component 1: “Talking” (The Autoregressive Verifier)

This is the first and most critical part of the model architecture. In this phase, the model’s job is to verify the drafts generated in the previous iteration ("on", "the") and decide if they are good enough to be kept.

How Parallel Verification Works

At the end, you might question yourself, “If the model has to check if the drafts are good or not, how would this be any faster than just generating them instead?” Let’s answer this question.

In a normal Autoregressive model, if you want to generate 5 words, you have to run the model 5 separate times. You feed in word 1 to get word 2, then feed in word 1+2 to get word 3, and so on. The GPU has to load the massive model weights from memory 5 separate times. This is the main bottleneck that needs to be eliminated.

This is the exact thing that TiDAR fixes when it verifies the draft tokens, because it can do this in one shot, which means 2 words ["on", "the"] are added to the output in just one forward pass. It uses a Causal Attention Mask for this process, which ensures:

When checking “on”, the model can only see “The cat sat”.
When checking “the”, the model can only see “The cat sat on”.

Because the GPU is a massive parallel processor, it can calculate the “correctness” of all these drafts simultaneously in a single operation. It is effectively doing 2 steps of work for the price of 1 step. That is where the massive speedup comes from.

The Instant Correction Mechanism

But what happens if the draft is wrong? What if the drafts were ["in", "pizza"] instead of ["on", "the"]?

The best part is that it doesn’t matter if the drafts are wrong. The correction is virtually free.

The model verifies the drafts by calculating a probability distribution over its vocabulary, conditioned on the context it gets. If the drafts are plausible predictions that the model could’ve chosen, they are selected, but if not, the model chooses the most probable word from the distribution it just calculated.

Since we ran this computation in the same forward pass, we don’t need to run the model again. We simply:

Discard the bad draft ["in"].
Instantly swap in the winner ["on"] from the probability list we just calculated.
Cut off all subsequent drafts ["pizza"] (because they were based on the wrong word).

This guarantees that the final output we end up getting is mathematically as valid as when the model was running slowly, step-by-step. We get the speed of parallel processing with the accuracy of sequential processing.

Component 2: “Thinking” (The Diffusion Drafter)

While the autoregressive “talking” component is busy in verifying which token to keep and which to reject, the “thinking” component drafts the tokens for the next iteration.

Filling the Empty Slots

Do you remember those [MASK] tokens at the end of our input sequence? The diffusion head tries to fill these blanks so that the autoregressive head can verify them in the next iteration.

For this part specifically, the model looks at all the words in the sequence at once. To do this, it uses a Bidirectional Mask instead of the usual Causal mask, but just for these [MASK] tokens.

Why Bidirectional?

Because the diffusion head has to draft multiple tokens at once, it has to be able to relate all words to all [MASK]. It effectively has to capture the “vibe” of the sequence to fill in the [MASK] tokens and hence, the Bidirectional mask.

For our example sequence, the Diffusion head looks at all the [MASK] tokens together, along with the history (“The cat sat on the”), and tries to “denoise” them into the most plausible and coherent text. It asks, “What 2-word phrase most likely follows ‘The cat sat on the’?” and it might come up with “red mat”.

The final causal mask, combined for both components, looks like the following:

(Source: Author)
For the prefix and draft tokens, the mask is a lower-triangular matrix (causal), but for the `[MASK]` tokens, there is no restriction as to where they can attend.

The Continuous Cycle

This creates a continuous cycle:

In Step 1, the Diffusion head guesses “on the”.
In Step 2, those guesses move into the “Draft” position.
The Autoregressive head verifies them (and corrects them if needed).
Simultaneously, the Diffusion head moves onto guessing the next phrase (“red mat”).

By constantly drafting ahead while verifying behind, TiDAR keeps the GPU fully utilized to the brim, ensuring that no computing power is ever wasted.

The Results

The researchers put TiDAR through a variety of tests to see if their novel approach actually delivers or not. Let’s have a look at what they concluded:

1. Speed: A Massive Leap Forward

The most critical metric for this architecture is whether it can improve inference speed, to which it does, and quite substantially.

When compared to a standard Autoregressive (AR) model, TiDAR demonstrates a significant increase in throughput. Throughput here refers to the number of tokens the model can generate per second.

For the 1.5B parameter model, TiDAR achieved a speedup of 4.71x. This means that this architecture can generate the same amount of text nearly 5X faster than a standard LLM architecture.
For the larger 8B parameter model, the resulting speed-up has an even greater gap, reaching upto 5.91x.

This is a drastic improvement from the conventional Next-Token Prediction schema, moving away from generating one token to drafting multiple tokens at once.

2. Quality: Closing the Gap

Till now, purely diffusion-based LLMs like Dream [4] or Llada [5] have always found it difficult to match the reasoning capabilities and coherence of the AR models.

TiDAR, however, with its hybrid approach, has managed to close this gap almost perfectly. By using the autoregressive head to verify the draft tokens made by the diffusion head, TiDAR can enjoy the fidelity of AR models and the speed of pure diffusion models simultaneously.

On benchmarks like HumanEval (coding) [6] and GSM8K (math) [7], TiDAR achieved scores that were “lossless” compared to the baseline AR model.
In fact, on some metrics, it even slightly outperformed the baseline, likely due to the “look-ahead” nature of the drafting process, which helps the model plan better in reasoning tasks.

(Source: Adapted from Liu et al. (2025) **[1]**, Table 2)
This table shows the accuracy scores of peer models when compared to TiDAR. “Trust AR” is the standard mode, where we weigh the AR head’s opinion more than the diffusion head’s opinion when it comes to deciding if the drafts are correct. “Trust Diff” is the mode where we weigh the diffusion head more heavily than the AR head.

3. Efficiency vs. Speculative Decoding

The authors also tested TiDAR against the current best method of speeding up inference, called EAGLE-3 (an algorithm based off of Speculative Decoding).

As discussed earlier, Speculative Decoding relies on a separate, smaller model to draft future tokens, which the main model can then verify. But the problem is that the smaller model makes a ton of mistakes, leading to rejected tokens and wasted compute. TiDAR, however, uses its own trunk to draft and verify the tokens. This makes the drafted tokens much more accurate and high-quality.

The “Acceptance Rate” (how often the drafts are correct) was significantly higher for TiDAR for the reason stated above.
This high acceptance rate means the model spends less time on correcting its mistakes and more time on generating the actual text.

(Source: Adapted from Liu et al. (2025) **[1]**, Table 1)
Shared with base: If the draft model and main model share the same trunk or not.
Parallel Decoding: If the drafter can write one token at a time or many tokens at once.
Parallel to Verification: If the architecture can draft and verify at the same time.

4. The “Free Token” Advantage

Finally, the results validate the core hypothesis of the paper: whether we utilize the GPU up to its absolute limits.

The experiments conducted by the authors conclude that the drafting mechanism of TiDAR adds almost no latency when compared to the standard forward pass. In a standard pass, the GPU is memory-bound, which means that the data onloading and offloading are the rate-limiting steps instead of the actual compute.

In TiDAR, however, we can load the GPU with extra work instead of letting it sit idle. The graph below basically tells us about how many tokens we can draft in one forward pass before the computation actually becomes the bottleneck for the GPU.
It turns out that we can draft ~60 tokens per forward pass, before the GPU starts being compute-bound.

(Source: Adapted from Liu et al. (2025) **[1]**, Figure 1)

In the graph above, the x-axis shows the number of drafted tokens and the y-axis shows the latency of the model. As observed, in the green region, the graph being flat suggests that there is no increase in latency even if we increase the number of draft tokens. It is only around 60 tokens (yellow region) that the latency starts rising, signifying that the actual computation is now taking more time than moving data to-and-from memories.
This means that we can theoretically generate 60 tokens at once, for no added latency.

👉If you liked this piece, I share shorter up-to-date writeups on Substack.
👉And if you want to support independent research writing, BuyMeACoffee helps keep it going.

References

Liu, J., Dong, X., Ye, Z., et al. (2025). TiDAR: Think in Diffusion, Talk in Autoregression. arXiv preprint.
Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast Inference from Transformers via Speculative Decoding. International Conference on Machine Learning (ICML).
Li, Y., Wei, F., Zhang, C., & Zhang, H. (2025). Eagle-3: Scaling up inference acceleration of large language models via training-time test. arXiv preprint.
Ye, J., et al. (2025). Dream-7B: Diffusion Large Language Models. arXiv preprint.
Nie, S., et al. (2025). Large Language Diffusion Models (LLaDA). arXiv preprint.
Chen, M., et al. (2021). Evaluating Large Language Models Trained on Code (HumanEval). arXiv preprint.
Cobbe, K., et al. (2021). Training Verifiers to Solve Math Word Problems (GSM8K). arXiv preprint.

Source link

About the author