Groq LPU

October 14, 20246 min read

The Al world was recently stunned when Groq, a relatively unknown hardware startup, announced that their LPU could run 300 tokens per second on Llama 70B. To put this in perspective, it's roughly the computational speed needed to generate the entirety of Shakespeare's works in about 30 minutes or process a full novel in less than a minute. Traditional GPI-J architecture, in contrast, can only generate 10-30 tokens per second using the same model. For context 1 token is roughly equal to one word. But to understand why this is a breakthrough, we first need to understand how modern Al processors work and why the LPU approach is so revolutionary.

Modern GPUs: The Current Workhorses of Al

Think of a modern GPU—say, NVIDIAs HI 00—as a vast city of tiny workers all specialised in doing calculations really quickly. These workers are, in turn, arranged into neighbourhoods (Streaming

Multiprocessors or SMS), which further include different specialised units for various types of calculations. These neighbourhoods coordinate with each other when a large language model, such as Llama 70B, sends work to them for text processing.

But here's where things get messy. Like a city during rush hour, modern GPI-Js face traffic problems. They have various systems in place to handle this traffic: cache memories (like local storage), branch prediction (trying to guess which route work will take), and out-of-order execution (letting faster tasks skip ahead of slower ones). While these features increase average performance, they render it impossible to know precisely how long any given operation will take — a phenomenon known as non-determinism.

The Problem with Non-Determinism

Non-determinism in processors is like trying to predict your journey time in a busy city. Sometimes you'll get lucky and hit all green lights, while other times you will be stuck in unexpected traffic. In GPI-J terms, that means the same Al operation can take 100 microseconds one time and 150 the next, depending on factors like:

Cache hits and misses (depending on if data is in local "storage" or must be fetched from far away) Memory access patterns—multiple workers trying to access the same data.

Branch prediction success or failure (making the wrong guess as to which path a calculation will take)

Resource contention (other processes contending for the same compute resources)

Such variability is acceptable in many applications. But in large language models, where each operation relies on the results from the previous ones, these tiny delays compound to massive slowdowns.

Enter the Tensor Streaming Processor

That's where Groq ls innovative approach comes in: instead of building a city with complex traffic systems, they've designed a massive assembly line. Their Tensor Streaming Processor, or TSP, the building block of their LPU, completely reimagines how processors should work.

It eliminates most of the sources of non-determinism by rearranging the computational units in a novel way.

Instead of having several general-purpose cores, it organises different kinds of computational units—matrix

multiplication, memory operations, vector processing—into vertical "slices" through which data flows in a carefully orchestrated manner. Imagine a vast conveyor belt system in which data moves in precise, predetermined patterns.

Integer ALO instructon pipeline

Memory (load/$tote)i instrucåon pipeline

What is unique about this approach is that the compiler—software that translates Al models into machine instructions— knows exactly where each piece of data will reside at every microsecond. There's no guessing, no congestion, no unexpected delays. It's like a perfectly orchestrated train system, where every train knows exactly when it will arrive at each station.

The Real-World Impact

Let's put that 300 tokens per second number in perspective. When you're using ChatGPT or Claude, you usually see responses appear at about human reading speed—call it 10 tokens per second. But in those applications where speed really matters—think processing huge documents, analysing real-time data streams, or running multiple conversations simultaneously—the difference between 10-30 tokens per second, typical of a GPU, and 300 tokens per second, which is Groq ls LPU is revolutionary.

To make this concrete: processing a typical business document of 2000 words (roughly 2500 tokens) would take:

Roughly 83 seconds on a standard GPI-J configuration

Only 8 seconds on Groq's LPU

For applications such as document analysis, customer service, or scientific research, where Al has to process huge amounts of text quickly, this order-of-magnitude improvement could be transformative.

The Future Implications

The success of Groq ls approach raises interesting questions regarding the future of Al hardware: After decades of processor development focused on improving average-case performance at the cost of accepting non-determinism, the LPU points toward another way: designing specifically for predictability and throughput rather than raw computational power. This may mean that there will be a divergence in Al hardware: GPI-Js for training and tasks where variability is acceptable, and LPU-like architectures for applications where consistent, high-throughput inference is important. Deterministic performance will become even more significant as Al gets integrated into more real-time applications, like autonomous cars or financial trading systems. The revolution in Al hardware isn't just about raw speed; it's a rethinking of basic assumptions on how to improve performance. That is what the LPU from Groq really shows: that sometimes the best way forward is not an incremental improvement of an existing design but to start with a different approach. Because we're going to continue pushing the boundaries of what Al can do, innovations like this are going to be important in making artificial intelligence not just more powerful but also more practical and reliable for real-world applications.