Computer Architecture in an AI-accelerated World with Jim Ledin
On memory hierarchies, GPU mechanics, hardware abstractions, and what engineers get wrong by ignoring the hardware layer
Jim Ledin has been thinking about what happens between the instruction and the silicon for over thirty years. He is the CEO of Ledin Engineering, an expert in embedded software and hardware design, and the author of Modern Computer Architecture and Organization, now in its third edition, published by Packt. His career spans embedded systems development, battery management software for electric vehicles, and cybersecurity assessment and penetration testing for safety-critical systems including self-driving vehicles.
The third edition comes out at a moment when the architecture conversation in software engineering has narrowed almost entirely to one question: what hardware should run AI workloads. Ledin’s answer is more nuanced than the GPU consensus suggests, and it is grounded in the kind of bottom-up reasoning that most application developers have never had to apply.
And this conversation covers where that consensus is incomplete, what engineers building AI systems are getting wrong about memory and parallelism, why abstraction layers become dangerous when they hide hardware costs, and what the architecture of a self-driving vehicle teaches you that distributed backend experience does not.
You can watch the full conversation below or read on for the complete Q&A.
Q. You have been working with embedded systems and hardware design for over thirty years. What first pulled you toward understanding what was happening at the hardware level rather than just writing code?
Jim Ledin: My first real exposure to computer architecture was in the 1980s when I had a Commodore 64 with its 6502 CPU. I wrote a simple basic program to do some screen drawing, basically moving a dot around the screen with the joystick and pushing the button to draw the lines. And it was slow. It was so slow you could watch it moving one pixel at a time. That was painful to try to do anything with.
As time went on I learned a little bit about 6502 assembly language. I found out there were ways you could implement that through the basic interpreter. What you had to do was write out your assembly code by hand, convert it to the opcodes and data bytes, and then poke those bytes into memory. Poke is the basic command. Then you could transfer control and execute them. After I took the inner loops of the drawing program and implemented them in that way, the speedup was amazing. It shot the line across the screen faster than you could see. That episode really cemented for me how important it is to understand what is going on in the hardware of a system, and not just write what you want to do in your favourite language.
My current work is focused on embedded systems development and testing, as well as implementing cybersecurity for those systems and doing cyber testing on them. I have done quite a bit of work with electric vehicles, battery-powered systems, the battery management software, as well as the powertrain control systems. Also implementing cyber testing to evaluate what kind of vulnerabilities may be present in systems and trying to exploit those to demonstrate whether or not they actually exist. I have been doing that for over thirty years, embedded initially and then later adding in the cybersecurity aspect.
The architecture of computer systems is at the boundary where performance, system security, and behaviour in real-world situations all meet. You really need to understand, across all those domains, that everything works as expected and intended. You need an understanding top to bottom, not just what your high-level software does, but also at the hardware level. Not necessarily saying you need to understand what your compiler is doing, but know how the hardware operates, what kinds of things cause you to run into limits, and what you can do differently to improve performance, reliability, and security.
Q. Your book Modern Computer Architecture and Organization is now in its third edition. What had changed enough in the field to make a new edition necessary?
Jim Ledin: The book is intended to start at the beginning. I do not assume that readers have any background or experience with computer architecture, assembly language instructions, memory cache, pipelines, or anything like that. We start with history, where did the first computing devices come from, how were they developed. It even starts back in the 1800s when Charles Babbage designed a mechanical computer intended to be a general purpose digital computing system. It never actually got built, but many of the principles he developed, including pipelining and distributed processing, were implemented in that design. I thought it was remarkable that those concepts were being worked out that far back in history.
Then the book goes through the vacuum tube era in the 1930s and 1940s, the Intel 4004, which was really the first microprocessor, and then on to the 8086, 8088, the PC, the 386, which is basically the same base architecture that modern Intel and AMD processors in your PC and server-based systems use today. The code running on these modern systems is highly compatible with those systems from decades ago. It has gone from 16 bits to 32 bits to 64 bits, adding capabilities without removing previous ones.
The book walks through that history and then goes into detail on how processors work, starting with the 6502. That processor is simple enough that you can understand what is going on with its registers. It only has three. Nothing about it is overwhelming. Once you understand it, you can build upon it to get to the modern processors, which are far more complex.
What changed substantially since the last version of the book was the rise of AI workloads, particularly the shift from the fastest CPU available to very highly parallelised systems optimised to perform matrix computations. The new version, which came out in March, has a chapter that goes into detail on how GPUs operate, from the top-level modular structure down to the granular details of processor cores. There is another new chapter on transformer-based models, looking at them not as someone who designs them but more like a mechanic who wants to take them apart. We work through what calculations actually occur in GPT-2, which was one of the earliest large language models to break through as something genuinely new and important. The current frontier models have obviously evolved quite a bit since then, but they share many of the same fundamental characteristics. If you can go through GPT-2 and understand how it works, you are a very long way toward understanding the latest models.
We are also seeing real diversification of architecture. There were many years where computers for most applications were based on the same Intel-type architecture, but now across different application areas you are seeing GPUs, TPUs, domain-specific accelerators for things like Bitcoin mining, local AI in cell phones and cars, and the open source RISC-V processor which is available to everybody. You can design your own chip based on it, implement it in an FPGA, do whatever you want. It is a rapidly growing line of processor development and the book covers all of it.
Q. The argument that GPUs are the right architecture for AI and LLM workloads is often treated as settled. Where is that consensus incomplete?
Jim Ledin: GPUs are probably the ideal architecture for people and small companies that want to run language models locally. I have recently gotten the Gemma 27 billion parameter model running on an Nvidia RTX 4090, which is about the top end of consumer GPUs available today. For local and personal use, GPUs are the way to go.
But for larger scale deployments running much larger models, the trend is toward dedicated TPUs. A tensor is basically a multi-dimensional array. A matrix is a two-dimensional array, and tensors have more dimensions. Tensors are used widely across AI models, and the work going on inside the processing of those models is largely matrix multiplications operating on broken-down portions of higher-dimensional tensors. A TPU is a processor similar in concept to a GPU, but very specifically focused on the work of large language model tensor processing. GPUs, as the first letter implies, also have silicon dedicated for generating real-time video and handling things like gaming and video creation. TPUs do not use silicon for that purpose. They focus everything on the tensor work.
That is why systems like the Nvidia Blackwell architecture, designed for large-scale data centre applications, are built to have many components interconnected with extremely high-speed data links, working together as a supercomputer. For larger models, consumer GPUs are not really used. It is more the dedicated hardware that focuses on that work.
Another factor is that AI workloads are becoming increasingly memory bandwidth limited. That means it is taking more time to bring data into the GPU or TPU memory than it is taking for the computation itself to complete. These very high-end systems are implemented using what is called high bandwidth memory, or HBM. An HBM module is basically a cube made of a stack of RAM chips, so they hold a lot of memory and have very high bandwidth. On a TPU card you typically have several of these HBM modules, and they have a far higher data rate for transferring data in and out of the processing components than on a typical consumer GPU. This is also part of why it is becoming hard to find DDR5 RAM chips. A lot of production capacity for memory is going into high bandwidth memory modules, which cost more for the purchaser and make more money for the vendor.
Q. Software engineers working in the cloud often treat hardware as someone else’s problem. What does your book argue they are getting wrong, and what does that cost them?
Jim Ledin: If you write software and just ignore the hardware limits, that can lead to a lot of hidden costs. If your code is accessing memory in inefficient patterns, not using the cache memory within the processor effectively, and moving data around more than necessary, that can have significant performance impacts.
If developers understand how the memory access and caching processes work at the hardware level, they can often tailor code to work more effectively within those constraints and minimise latency. When the CPU requests data from memory and it is not available in its cache, it has to wait. You are giving the processor downtime when you want it to be processing data. A lot of that is unavoidable, but the amount of latency can be minimised by different approaches to optimising algorithms.
As an example, in a modern PC, each time you read something from DRAM, even if it is just a single byte, 64 bytes are transferred into the CPU cache. That is what is available at that point for the processor to work with. For best efficiency, assuming you have options, you would want your code to be working with data from that block before it moves on to something else, rather than bouncing around to other memory locations. If you access several other locations that cause them to be loaded into cache, and then that first block gets evicted, and then you go back and read it again, now you have to reread it. That is inefficient. When possible, you want to work through memory in a linear way.
And if you are working in a cloud environment, this not only has those performance issues but also results in higher costs, because you are paying for the usage of the system whether the CPU is actually crunching instructions or sitting idle waiting for a data item to come in from memory.
Q. If you are building AI systems today, what are the hardware concepts that would most change how you designed them, and what do most engineers not understand well enough?
Jim Ledin: Data movement can often be more expensive than the actual computation steps. The latency of moving large data structures across different levels of the memory hierarchy can dominate and leave a lot of compute bandwidth idle. This is a concern even with the very highest performance AI-focused systems. Getting the memory access right relative to the processing is a genuine challenge. You definitely do not want to be iterating across large data structures multiple times in an algorithm if there is a way to avoid it. Going through data linearly is probably going to give the best performance.
As you increase parallelisation of algorithms across cores and processors and across GPUs and other devices, other constraints appear. Synchronisation, where tasks on different processors need to sync up, is a real constraint. The communication bandwidth between processors, whether they are inside the same device or communicating board to board or rack to rack, all of these affect the efficiency and speed of processing, not just the number of cores you can throw at a parallel algorithm. It is important to understand the cost associated with all of these interactions among parallel activities and optimise around them to get the best overall performance.
And then optimising compilers do a great job of scheduling instruction execution and keeping pipelines full, but there are things you can do in code that make it harder for them to do that, and things you can do that make it easier. In performance-critical inner loops, minimising branching can help avoid pipeline stalls. Part of what goes on in a modern processor is trying to predict what will happen at a branch in your code, an if-else type block. The processor may guess right, which means it is very efficient, or it may guess wrong and have to back up and start down the other path. If you can minimise or eliminate branching within the most performance-critical loops, that makes it easier for the optimiser and the rest of the system to run as efficiently as possible.
Q. What is actually happening under the hood in a GPU that makes it effective for AI workloads, at a level that goes beyond the standard explanation about parallelism?
Jim Ledin: Most of the processing in a transformer-based AI model, at least 80% of the execution time, is these tensor operations, which are implemented in hardware as matrix multiplications. GPUs and TPUs have very specialised multiply-and-accumulate hardware specifically designed to perform these operations.
The current generation of Nvidia GPUs implements what is called single instruction multiple thread, or SIMT, execution. A group of 32 threads runs in lockstep, meaning they are all executing the same instruction but on different data streams. SIMT also supports branching, so you can have if-else logic in the code. But this has a performance cost. If you are executing through a stream of data on SIMT code and you come to a conditional instruction where some threads take the if part and some threads take the else part, the hardware executes one side, the if part, only on the threads where that condition applies, then goes back and executes the else part for the other threads. At the end of the block, they sync up and resume in lockstep. Your code can have conditional logic in these lowest-level operational sequences, but there is the drawback that you effectively have a pipeline stall where it has to go back and execute a different thread. You have the flexibility, but there is a cost.
GPU and TPU performance comes as much from high memory bandwidth, getting data in and out as fast as possible, and latency minimisation, as it does from effective thread scheduling across the many thousands of cores within a GPU. All of these things, memory bandwidth, minimising latency, thread scheduling, and using SIMT effectively, all affect GPU performance in addition to the raw ability to parallelise across cores. You really need to manage all of these aspects to get the best performance, not just maximise core count.
Q. The memory hierarchy from cache to RAM to storage is often discussed in theory but rarely in practice. Can you give a concrete example of where a misunderstanding of memory hierarchy caused a real performance problem, and what the fix actually looked like?
Jim Ledin: There was a web server in some Linux distributions in the early 2000s called Tux, which ran in kernel space. It avoided a lot of the transfers from user space to kernel space that a web server normally has to perform. It only served static pages, because since it was running in kernel they did not give the pages dynamic generation capability.
One issue with this server was poor cache locality. The amount of data it kept active on each request seemed to be excessive. Under high load, with lots of users hitting it at once, the state information grew to exceed the size of the level 2 cache in the CPU. Performance dropped off sharply.
Some engineers examined that and determined that by evaluating the cache limitation against how the code was structured, they could reorganise it so the amount of data per request would be smaller and therefore remain within the cache limit up to a much larger level of usage. Similarly, instructions also have a cache in the CPU, and by reorganising the processing and batching some things, they were able to increase the degree to which instructions would remain in the instruction cache during web server processing. The fixes they implemented increased application performance by about 40%.
This was basically examining the behaviour of the application in the context of the limitations of the processor hardware and coming up with solutions that respected those limits. For other applications, similar fixes might involve restructuring data. A large array of structures might be more efficiently processed as a structure of arrays in a way that better aligns with cache limitations. But in these cases, while the design approach is to look at the limits of the system and try to work within them, to really understand what is going to have a big impact you need to implement it and benchmark it in a realistic environment.
Q. There is a growing tension between the abstraction layers that frameworks provide and the hardware cost those abstractions hide. At what point does that become a serious engineering problem?
Jim Ledin: Early on in the development cycle, abstractions are great. They can greatly accelerate development and limit mistakes. Where it becomes dangerous is when abstraction obscures what is happening with the data layout in memory and the execution patterns, basically how the processor is interacting with data as the algorithm proceeds. This is especially critical in large-scale real-time systems with demanding performance requirements.
In addition to using abstraction where it makes sense, engineers need to understand what is happening underneath the abstraction in performance-critical applications. I am not suggesting abandoning abstractions. They are entirely appropriate at the level where they preserve meaning and understanding across a team. But they begin to create a problem where they obscure the costs.
The most effective approach is a two-layer design. Use the most expressive code at the edges of the system, and in the core, use more performance-aware code. It is not always obvious where to place the boundary between performance-aware code and more expressive code. It may take some benchmarking, trials, and iterations to identify the best location for that boundary. But knowing you need to draw it is the starting point.
Q. You work on architectures for systems like self-driving vehicles. What makes those architectures fundamentally different from a standard distributed backend system, and what should engineers working in conventional contexts take from that?
Jim Ledin: A self-driving vehicle is both real-time and safety-critical. The software must meet all of its deadlines, its time limits for producing a response, or that is not just a glitch to blow past, that is a system failure, and that cannot be tolerated. There must be fail-safe responses when unexpected situations occur. Only in the most extreme circumstances, like an unrecoverable hardware failure, would the system be able to stop processing.
A self-driving vehicle tightly couples sensing, computing, and actuating, seeing what is around it, deciding what to do, and steering and controlling vehicle speed. That is pretty different from loosely coupled distributed systems. A distributed system might typically implement retry mechanisms if something fails, and if a system goes down there are online and offline redundant capabilities that can be brought up, basically switching to a backup. Rather than using that approach, safety-critical vehicles provide a level of redundancy where dual processors operate in lockstep. If one experiences a failure, the system continues on the one good one until a repair can be made.
This can be extended further. The American space shuttles had three computers operating in parallel. One advantage of three over two is that if you have two computers and they give different answers, you have to decide which one is good and which is bad. If you have three and two give one answer and one gives another, you probably know the third is bad.
The way engineers working with conventional distributed systems can apply these principles is in situations where the design needs to be fault tolerant while minimising or eliminating processing interruptions. Rather than waiting for a failure to occur, you have enough running capability in operation simultaneously that you can detect a failure and keep things going the whole time while bringing up redundant capability. A lot of large systems already operate this way, but systems that do not could potentially deliver a higher availability level using these techniques.
Q. For engineers who want to build real working knowledge of systems and hardware, what is the most direct path in?
Jim Ledin: Start by understanding how processors operate at the simplest level of a single instruction. There are four steps in an instruction: fetch, decode, execute, and write back. Fetch is when the processor retrieves the opcode bytes from memory. Decode is when the processor assigns work to units within it, like an ALU for an addition instruction. Execute is when it actually does the computation. And write back is where it stores the results in registers, in memory, and in status bits within the processor. Essentially all processors operate at that very low level.
That mental model then scales upward to more complex processors and their capabilities. That is the reason the book starts with the 6502 processor. It is pretty simple, only three registers, 8-bit, nothing about it is overwhelming. But once you understand it, you can build upon that knowledge to get to the modern processors, which have hundreds, if not more, instructions available and many divergent capabilities. It all builds upon those very simple foundations.
Q. Looking ahead five years, what skills will matter most for engineers working at the intersection of software, hardware, and AI?
Jim Ledin: The most important thing is to stay up to date and remain aware of changes as technology advances. Four years ago when the previous version of the book came out, it was not at all clear to me, or I think a lot of people, what was going to happen with AI in the coming years. Pay attention to what is going on around you. Pay less attention to announcements driven by financial considerations or hype from companies focused on their performance in the stock market. Pay more attention to what is actually having an impact in the real world, and learn more about those things.
The sources matter. There are trustworthy websites with genuinely good information about current ongoing activities in CPU development and other computer-related areas, as well as more in-depth sources like scientific papers if you are willing to dig in at that level. Even pursuing formal education, which does not necessarily mean going back to college, could mean taking online courses to develop depth in areas where you might be behind. Certificate programs can be a real path to updating your skills.
Today, the thing is AI. Developers do not just learn programming languages anymore. You need to be learning how to interact with AI and use it effectively to develop better software. The way to really understand these systems requires the ability to reason across all of the abstraction layers, from the software framework at the top level all the way down to the hardware that runs the code. You do not need to break out the assembly code generated by your build tools, though that is sometimes valuable and can be very helpful, either for learning purposes or if you are really in a hot inner loop that needs maximum optimisation. More often it is about understanding the constraints, how the processor works best with pipelines and caches, and orienting your code to work within those environments.
It is also becoming increasingly critical to understand heterogeneous computing environments. It is not just writing code that runs on a CPU. You might have code that interacts with a GPU for a parallelised algorithm, whether it is a language model or something else. And there are specialised accelerators that may be implemented within large-scale systems that speed up specific parts of the operation. There is a lot to learn, and it takes curiosity and sustained attention to stay current.
Q. How would you explain the CPU versus GPU distinction to a senior software engineer who has never had to care about it before?
Jim Ledin: A CPU is optimised for low-latency execution of complex branching code. Branches do have an impact on performance, but CPUs are designed to handle that and minimise it. GPUs work best with highly parallelised, high-throughput execution of linear code, operating on massively parallel workloads. GPU cores work best when they are going through parallel streams with minimal branching.
If you are developing an algorithm and you are not sure whether it should run on a CPU or be split between a CPU and a GPU, the GPU only really becomes attractive when you have enough work for it to do that it can be parallelised, and enough that it will amortise the costs associated with moving data onto the GPU, launching the kernels to execute the code, and doing the management work to transfer data to and from the GPU.
The GPU is really not a general purpose computer. It is more of a specialised device that needs to be managed by something else. You cannot write a program that just runs on a GPU. It needs to be started and managed from a CPU, and you need to get enough benefit from the work you are doing to make all of that worthwhile. If you cannot keep the GPU busy with this kind of work, the CPU implementation may actually win, because it avoids the data transfer and scheduling overhead entirely.




