Compute Obsession Is Slowing Down AI Systems
Why data movement costs more than computation and what most engineers building AI systems are getting wrong because of it
Engineers building AI systems today tend to focus on compute first. It is typically about how many GPU cores, how many parameters, how much VRAM, and how to extract more from all of it. While the benchmarks are about throughput and inference speed, the infrastructure conversations are about scaling horizontally across more hardware.
Jim Ledin, a seasoned engineering leader, CEO of Ledin Engineering and author of Modern Computer Architecture and Organization (third edition, Packt), thinks that framing misses the most important constraint in production AI systems. The bottleneck holding back real-world AI performance is not compute but data movement.
“Data movement can often be more expensive than the actual computation steps,” Ledin says. “The latency, especially moving large data structures across different levels of the memory hierarchy, can dominate and leave a lot of your compute bandwidth idle.” This is not a niche embedded systems concern. It is happening in the largest AI deployments in the world, and it is the reason hardware vendors like NVIDIA are designing systems the way they are today.
Continue reading or watch the full conversation with Jim Ledin below.
Memory bandwidth is slowing your AI system more than your GPU is
When a CPU or GPU requests data from memory and that data is not available in cache, the processor waits while the computation units sit idle. In a consumer application, that idle time seems like a minor inconvenience. But in an AI system processing large tensors continuously, it accumulates into a significant fraction of total runtime.
“AI workloads are becoming increasingly memory bandwidth limited,” Ledin shares, pointing to a dynamic that is reshaping how AI hardware gets built. “It is taking more time to bring data into the GPU or TPU memory than it is taking for the computation to take place on the data.” The raw ability to multiply matrices is no longer the binding constraint. But getting the data to the multipliers fast enough is.
This is exactly why high bandwidth memory exists. HBM modules are stacks of RAM chips built into a cube, physically close to the processing units, with far higher data transfer rates than conventional DRAM. “On a TPU card, you typically have several of these HBM modules,” Ledin explains, “and they have a far higher data rate for transferring data in and out of the GPU processing components than on a typical consumer grade GPU.” The engineering bet being made with systems like NVIDIA’s Blackwell architecture is that memory bandwidth is worth more than raw core count, because the cores are already faster than the data can reach them.
But there is a side effect that touches anyone buying consumer hardware. “A lot of the production capacity for memory is going into these high bandwidth memory modules, which cost a lot more for the purchaser and make a lot more money for the vendor,” Ledin observes. That is a direct reason DDR5 has been difficult to find and expensive when available. The memory fabs are prioritizing the more profitable HBM production, and consumer DRAM is downstream of that decision.
The hardware cost your cloud bill is hiding
Most software engineers, especially those working in cloud environments, treat the hardware as someone else’s concern. The abstraction is good enough, the managed services handle the infrastructure, and the code runs somewhere. Ledin’s argument is that this hands-off relationship with hardware has a real cost that shows up in performance and in cloud bills.
“If your code is accessing memory in inefficient patterns, if you are not using the cache memory within the processor in an effective manner, and if you are just moving data around more than is necessary, that can all have significant performance impacts,” he warns. The CPU requests data from memory, and if it is not in cache, it waits. “A lot of the time it is unavoidable, but the amount of latency can be minimized by different ways of optimizing algorithms.”
The mechanics are specific. When a modern CPU reads from DRAM, even a single byte triggers a 64-byte cache line transfer. The processor brings in a block of adjacent memory whether it needs all of it or not. If the algorithm then jumps to a different memory location, causes that block to be evicted from cache, and later needs it again, it has to re-read it from DRAM. That is wasted time. “For best efficiency, you would want your code to be working with data from that block before it moves on to something else,” Ledin explains, “rather than bouncing around to other memory locations.”
In a cloud environment, this inefficiency does not just slow things down. It costs money, and there is no incentive for cloud providers to surface it clearly. “You are paying for the usage of the system whether the CPU is actually crunching instructions or the CPU is idle waiting for a data item to come in from memory,” he points out. The cloud bill does not distinguish between productive cycles and stall cycles. Engineers who understand cache locality can write code that reduces stalls and therefore reduces cost, not just latency. Optimizing for cost comes down to understanding your memory access patterns and engineering around them, not just choosing the right managed tooling stack.
Drawing from his engineering work across embedded and production systems, Ledin shares a useful example. A Linux web server called Tux, which ran in kernel space to avoid user-to-kernel data transfers, developed a performance problem under high load because its per-request state data grew large enough to exceed the CPU’s level two cache. “Performance dropped off sharply,” he recalls. Engineers analyzed the cache behavior, restructured the data layout to keep per-request state smaller, and did the same for instruction caching by batching related processing together. “Fixes that they implemented increased the application performance by about 40%.” No new hardware, no architectural overhaul. Just understanding where the memory ceiling was and designing around it.
GPUs are the right tool, but not always for the reason you think
The assumption that GPUs are the correct architecture for AI workloads is not wrong, but it is incomplete in a way that matters for engineers making infrastructure decisions. Ledin draws a distinction that is often glossed over in the mainstream conversation about AI hardware.
“GPUs are probably the ideal architecture today for people and small companies that want to run language models locally,” he says, drawing from personal experience. He recently ran the Gemma 4 26-billion-parameter model on an NVIDIA RTX 4090, and for that use case the GPU is the right tool. But for larger-scale deployments running the much larger frontier models, the picture is different. “The trend there is for dedicated TPUs,” he notes.
The distinction matters because GPUs carry silicon dedicated to graphics work that has nothing to do with tensor operations. A consumer GPU has hardware for real-time video rendering, gaming pipelines, and display output. A TPU does not. “TPUs do not use up silicon for that purpose and focus everything on the tensor work,” Ledin explains. When you are running thousands of inference requests at scale, that difference in silicon allocation translates directly into efficiency at the workload that actually matters.
There is also the SIMT execution model to understand. Modern NVIDIA GPUs run 32 threads in lockstep, all executing the same instruction on different data streams simultaneously. This is efficient for linear, parallel workloads. When those threads hit a branch, a conditional where some threads take the if path and some take the else path, the hardware executes one side then goes back and executes the other. “You basically have effectively a pipeline stall where it has to go back and execute a different thread in that kind of situation,” Ledin highlights. The flexibility is there, but it comes at a cost. “Avoiding branching if possible can have a significant impact on performance.”
For engineers deciding where to run inference workloads, Ledin offers a practical heuristic. “The GPU only really becomes attractive when you have enough work for it to do that it can be parallelized and enough that it will amortize the costs associated with moving data onto the GPU, launching the kernels, and doing the management work to transfer data to and from the GPU.” If the workload is not large enough to keep the GPU busy, the CPU implementation may be faster because it avoids all that overhead entirely.
Frameworks are hiding costs that engineers need to see
Frameworks and libraries have made it possible to build sophisticated AI systems without ever thinking about what is happening in hardware. That is mostly a good thing. The abstraction accelerates development and reduces mistakes. But there is a point where abstraction stops being a benefit and starts hiding costs that need to be visible.
“Where it becomes dangerous to use too much abstraction is when it obscures what is happening with the data layout in memory and the execution patterns,” Ledin cautions. In performance-critical applications, the framework is making decisions about how data is structured and how the processor interacts with it. If the engineer does not know what those decisions are, they cannot tell when they are working against the hardware.
The practical approach Ledin recommends is a two-layer architecture. “Use the most expressive code at the edges of the system, and in the core, use more performance-aware code.” The boundary between those layers is not always obvious in advance, and finding it usually requires benchmarking rather than reasoning. But the principle is clear: abstractions are appropriate where they preserve meaning across the team, and they become a problem where they hide costs that affect the system’s ability to meet its requirements.
One specific pattern worth knowing is the array of structures versus structure of arrays tradeoff. A common data layout is an array of objects, where each object holds all the fields for one entity. For CPU cache efficiency, it can be significantly better to restructure this as a structure of arrays, where each field is stored as a separate array for all entities. “That might have a big impact on performance,” Ledin notes, because the CPU cache loads contiguous memory, and if the algorithm is operating on one field across many entities, the structure of arrays layout means each cache load is full of useful data rather than fields the algorithm is not touching.
The skills that will matter when the hardware changes again
The specific technologies that matter five years from now are difficult to predict, and Ledin is honest about that. “Four years ago when the previous version of my book came out, it was not at all clear to me, or I think a lot of people, what was going to be happening with AI in the coming years,” he says. Predicting which hardware architectures or AI frameworks will dominate is not the point. Building the mental model to understand them when they appear is.
The foundational skill is the ability to reason across abstraction layers. “The way to really understand the system requires the ability to reason across all of the abstraction layers from the software framework that you are working on at the top level, all the way down to the hardware that runs the code,” Ledin underscores. That does not mean reading assembly code for every application. It means understanding how pipelines and caches work and orienting code to work within those environments rather than against them.
The other shift is heterogeneous computing. Writing code that runs on a CPU is no longer sufficient context for many engineering problems. “It is also becoming more critical to understand heterogeneous computing environments,” Ledin says. “It is not just writing code that runs on a CPU. You might also have code that interacts with the GPU if you are running a parallelized algorithm on that, whether it is a language model or something else.” Domain-specific accelerators, TPUs, RISC-V implementations, and specialized inference chips are all becoming part of the environments that production engineers have to reason about. The engineers who will be most effective in that landscape are the ones who understand why those architectures make the tradeoffs they do, not just how to call their APIs.
This article is based on Deep Engineering #46. You can read the full issue, including additional insights from Jim Ledin on modern computer architecture and AI infrastructure,




