Designing for Scale and Resilience: A Conversation with Dhirendra Sinha and Tejas Chopra
Two engineering leaders from Google and Netflix reflect on scaling under pressure, designing for failure, and how they assess real-world thinking in system design interviews.
From autoscaling AI agents to ensuring observability in microservices, system design today demands far more than textbook patterns. In this conversation, we speak with Dhirendra Sinha and Tejas Chopra—authors of System Design Guide for Software Professionals—about what Big Tech gets right about scalability, why foundational principles still matter in the age of AI, and how engineers can grow by mastering trade-offs rather than memorising solutions.
Dhirendra is a Software Engineering Manager at Google with two decades of experience designing complex distributed systems across startups and global tech companies. He actively mentors teams on scaling architecture and has taught system design for over seven years. Tejas is a Senior Engineer at Netflix, where he focuses on machine learning and large-scale platform architecture. He’s also a Tech 40 under 40 awardee and a frequent speaker on systems, scale, and reliability.
You can watch the full interview below—or read on for the complete transcript.
1. What inspired you both to write System Design Guide for Software Professionals? What key concepts or gaps did you aim to address?
Dhirendra Sinha:
I’ve been in the industry for more than two decades—working at startups and big companies, designing complex, large-scale systems. I started teaching system design about seven or eight years ago, and there were two main reasons for that. First, I wanted to give back to the community. Second, to be honest, I was being a little selfish—moving into management was taking me away from core technology, and teaching allowed me to stay connected to it.
I’d always thought about writing a book, but never quite had the time or courage to get started. When Packt reached out—actually, one of my mentors recommended me to them—I saw it as a great opportunity. But I also insisted that they find a co-author. I didn’t want to go through the whole thing alone. Tejas was recommended, and we clicked right away. He’s been a fantastic collaborator.
The main motivation for writing the book was to go deeper into system design concepts, especially for senior candidates preparing for interviews. System design is a crucial part of the interview process—not just for getting accepted, but also for determining your level once you’re in. That’s what we wanted to focus on.
Tejas Chopra:
First off, thank you to Dhirendra for those kind words. He’s been an incredible co-author and collaborator—I couldn’t have asked for a better partner on this project.
Like Dhirendra, I’ve been in the industry for a while, working at companies like Box and Netflix, where I’ve seen the challenges of scaling software systems. When systems aren’t architected properly, they can fail in very unexpected ways. And I’ve seen senior engineers struggle with bridging the gap between coding and designing scalable systems.
That was the initial motivation—to explore system design not just as a set of interview questions, but as a foundational discipline. We didn’t want to write another book focused solely on interview prep. We wanted to create a reference guide that explains why certain architectural decisions are made and how to think about systems that go beyond the interview room—into real-world implementation in scalable organizations.
Our goal was to demystify the core principles of distributed systems. We could have covered more ground, sure, but we saw this book as a solid foundation to build on. That’s what motivated us to write yet another book on system design.
2. In your book, you've laid down foundational principles of system design. How do you see these principles interfacing with the rapid advancements in AI, such as AI-driven automation and operations in system design?
Tejas Chopra:
That's a great and very timely question. We’re in an era where everything is about AI agents and the tools they enable. But foundational system design principles—like scalability, reliability, and efficiency—are remarkably timeless. In fact, the rise of AI only reinforces the importance of these principles.
At companies like Netflix and Box, we’ve seen the emergence of AI Ops—using AI for operational efficiency and scale. These intelligent systems automate tasks like autoscaling, anomaly detection, and self-healing. In that sense, AI is helping us build more scalable and resilient systems.
That’s one side of the puzzle. The other side is: how do you design systems to support and scale AI agents themselves? That’s an active challenge in the industry. For example, OpenAI recently released Ghibli-style image generation—and within a day, Sam Altman tweeted asking users to pause because the GPUs were overwhelmed. That’s a system design issue. It shows the architecture didn’t fully account for the parallelization needs of those models. So there’s clearly an opportunity to design better systems there.
AI and system design aren’t in conflict—they complement each other.
Dhirendra Sinha:
I’m on the same page. System design principles are foundational, and they don’t go away in the AI era. Whether we’re talking about automating workloads or improving scalability and reliability, all of it is still grounded in sound architecture.
AI can optimize, recommend, and augment, but it can’t compensate for a poorly thought-out architecture. If the foundation isn’t strong, the system will be brittle—no matter how much AI you throw at it.
For example, I worked on predictive autoscaling systems before this current wave of AI, and they were incredibly useful. But even then, if the architecture wasn’t right, we’d hit limits quickly. That’s why we need a solid foundation for AI to work its magic. In some ways, AI actually exposes the gaps in your system design—because it optimizes one thing while leaving the underlying structure vulnerable.
So yes, AI is powerful, but it makes mastering the fundamentals of system design even more critical.
3. Drawing from your experiences at companies like Netflix and Google, what are the current challenges in system design today?
Dhirendra Sinha:
There are a few challenges that stand out based on my time at Google, Yahoo, and other companies.
The first is scalability under unpredictable load. When you have global users accessing services 24/7, systems need to handle sudden spikes in usage without overprovisioning. Even with all the optimization in prediction models, unexpected traffic can still overwhelm systems. That’s a persistent challenge.
The second is the trade-off between data consistency, performance, and availability. While we've made progress here, it remains a key issue. In interviews and real-world scenarios alike, the question often is: how do you balance these trade-offs? It’s a constant juggling act.
Third is security and privacy at scale. It’s one thing to design secure systems for small user bases—but at scale, it gets tricky. You need to balance encryption, access controls, and compliance without compromising performance. With the rise of new threats and regulations, this is getting harder.
And finally, AI introduces new uncertainties. We're still figuring out how to co-exist with it in our systems, so that’s a space full of evolving challenges.
Tejas Chopra:
I completely agree with everything Dhirendra said. These are the same types of challenges I’ve seen at Box and Netflix.
For example, we once had a live-streaming event where we expected a certain number of users—but ended up with more than three times that number. Our systems struggled—not because we didn’t design them properly, but because we had made assumptions about dependencies. In distributed systems, you don’t own all the parts—you depend on external systems. And if one of those breaks under load, the whole thing can fall apart.
Another major challenge is observability. At Netflix, we have a huge number of microservices. When something goes wrong, we need to know which service called what, when, and with what parameters. There are tools that help with this, but at our scale—with lots of asynchronous code—stitching together a timeline can still be very difficult.
Then there’s cost efficiency. With the number of users and content we handle, we need systems that are not just scalable but also cost-efficient and extensible. And of course, data privacy and security remain front and center—especially as AI capabilities expand. We’re constantly trying to balance innovation with the guarantees we’ve traditionally offered.
4. What are some best practices Big Tech applies to approach scalability and system robustness?
Tejas Chopra:
That’s a great question. At Netflix—and I’ve seen this at other big tech companies too—one of the most important principles is designing for failure. We assume worst-case scenarios from the beginning and operate with the mindset that everything will fail eventually.
This thinking led to the creation of Chaos Monkey, a tool Netflix built to randomly shut down services in production to test system resilience. The idea is to ensure that even when something goes down, the system continues functioning smoothly with backup responses in place.
Another best practice is continuous automation—automating routine jobs wherever possible, backed by strong monitoring and observability. We also rely heavily on auto-scaling and use caching and well-defined boundary conditions. For instance, being very explicit about how many users or requests your current architecture can support is essential. Systems often fail because someone made unrealistic assumptions.
Finally, there’s gradual rollout. When we deploy something new—like a recommendation algorithm—we don’t release it to the entire user base. We target specific cohorts first, observe the impact, and then gradually expand. This applies not just to algorithms but to any system-wide change. We define availability zones and roll out to subsets before scaling globally. It’s a critical way to reduce blast radius.
Dhirendra Sinha:
Yeah, I remember when I first heard about Chaos Monkey—it really caught my attention. The idea of planning for failure instead of assuming perfection is exactly right. At scale, even a “once in a million” corner case happens every hour. So you can’t afford to dismiss edge cases.
At Yahoo, I remember someone saying, “This case is so rare, it’s not a big deal,” and the chief architect replied, “One in a million happens every hour here.” That’s what scale does—it invalidates your assumptions.
We also emphasise fault tolerance and horizontal scalability—the system should recover quickly when a failure occurs. I encourage my team to think not just about launching features, but about landing them. It’s not just about shipping code; it’s about how that feature behaves in production under different loads and operational scenarios.
We also focus on automating everything we can—not just deployments, but also alerts. And those alerts need to be actionable. We rely heavily on infrastructure as code—using tools like Terraform and Kubernetes—where you define the desired state, and the system self-heals or evolves toward that.
So, whether it’s designing for failure, observability, or thoughtful automation—these are all key to building robust systems at scale.
5. During the interviews you conduct for hiring within system design roles, what do you look for in candidates? What are some of the most important questions you ask, and why?
Dhirendra Sinha:
System design interviews aren’t really relevant for fresh graduates—they’re more focused on coding and algorithms. But after three to five years of experience, system design becomes an important part of the interview process.
As a hiring manager who has interviewed and hired many engineers, I look for structured thinking—how someone approaches an unfamiliar problem. Unlike algorithmic questions, which often have a right or optimal answer, system design is messy. You can take it in many directions, and that’s what makes it interesting.
So I’m evaluating how candidates communicate, how they break down complexity, and how aware they are of scalability and resilience. For more senior candidates, I deliberately introduce ambiguity to see how they drive the conversation and navigate trade-offs.
It’s easy to choose between good and bad solutions, but senior engineers often have to choose between two good options. That’s a much harder decision. I want to hear their reasoning: Why did you choose this approach? What trade-offs did you consider?
Of course, I can’t share exact questions, but even common ones—like designing WhatsApp or Uber—can be taken in many directions. As an interviewer, I can add twists or constraints and see how candidates adapt. I also ask about their real-world experience—the trade-offs they made in past projects, how they handled failure, and what they learned.
At the end of the day, I’m looking for engineers who can design scalable, resilient systems and make practical decisions under real-world constraints. I want to see that they’ve already made mistakes, learned from them, and are bringing those lessons with them.
Tejas Chopra:
I completely agree with everything Dhirendra said. Structured thinking and trade-off analysis are critical—and system design interviews are a great reflection of what you actually do on the job.
Personally, I try to keep the interview conversational. I want the candidate to challenge some of my assumptions. For example, they should ask: How many users are we designing for? What are the must-haves vs. nice-to-haves? That kind of questioning is exactly what happens in real projects.
I also intentionally keep the problem broad. It gives me insight into how candidates choose to scope it. If they go too wide, they may not go deep enough. If they keep it narrow, we can dig deep into the design decisions and trade-offs.
For instance, if they suggest using a database, I’ll ask: SQL or NoSQL? Why? That tells me how they compare options and what factors influence their choices. I might then say: What if you had 10 times more users? Would your choice change? Why or why not?
I also like to explore topics like consistency—eventual vs. strong—and see how well they understand the CAP theorem. Many candidates don’t fully grasp how consistency, availability, and partition tolerance interact in real-world systems, so that’s something I focus on.
6. What are common areas where candidates struggle, and what advice would you give them?
Tejas Chopra:
One of the biggest issues I see is that candidates don’t spend enough time rightsizing the problem. As soon as they hear the prompt, they jump straight into drawing diagrams or proposing solutions, without fully understanding the functional and non-functional requirements. The first 5–10 minutes should be spent asking clarifying questions—what exactly are we designing, what are the constraints, what assumptions can we make?
Another common pitfall is lack of structure. Candidates often jump from one part of the system to another without a coherent plan. They might go deep into one component while ignoring others, which makes the conversation feel scattered. I’ve also seen a fair amount of overengineering—getting lost in unnecessary complexity instead of building a simple, workable baseline and refining from there.
The strongest candidates take a progressive approach: they build a basic version of the system that satisfies core requirements, and then layer on complexity as needed. That’s a sign of someone who can handle real-world system evolution.
Dhirendra Sinha:
Tejas is absolutely right. The biggest mistake I see is people not clarifying the problem or scoping it down properly. They hear something familiar, get excited, and dive into design mode—often without even confirming what they’re supposed to be designing. In both interviews and real life, that’s dangerous. You could end up solving the wrong problem.
Another challenge is around non-functional requirements, especially as you become more senior. People often don’t understand the nuances of consistency and availability trade-offs. As Tejas mentioned, the CAP theorem is widely misunderstood. It’s not always about choosing one model for the whole system—some parts may require strong consistency, others can be eventually consistent. Candidates often fumble when we dig into those distinctions.
I also pay close attention to how candidates explain their choices. If there are two good options on the table, I want to hear why they picked one over the other—not just because they’re familiar with it. I’m not expecting perfect answers. Even a slightly incorrect choice is fine, as long as it’s backed by clear reasoning.
Two final tips:
Time management—don’t get stuck in the weeds early on and run out of time for the core design. Practice mock interviews and pace yourself.
Listen to the interviewer’s cues—if we hint that it’s time to move on or steer toward a particular area, that’s your chance. We want to help you succeed, but if you miss the cues, we can’t evaluate you properly.
7. Beyond job interviews, why should software professionals invest in learning about system design? How does this knowledge impact their career development and capabilities as engineers?
Dhirendra Sinha:
That’s an excellent question. A lot of people revisit system design only when they're preparing for interviews—which is understandable. But the truth is, having a strong grasp of system design concepts pays off in many areas of your career.
It shows up in promotions, in how you evaluate and write design documents, and in how you participate in architectural discussions. When you understand the fundamentals deeply, you’re able to ask sharper questions and make more informed technical decisions—whether it’s choosing a framework, debugging an issue, or planning a rollout.
System design also improves your technical communication. As you grow more senior, it’s not just about coding—it’s about presenting your ideas clearly and convincingly. That’s a huge part of leadership in engineering.
There’s also a strategic planning angle. If you’re asked to estimate a project and you don’t understand the system design implications, you’re just guessing. Your estimates won’t be grounded in reality. Understanding architecture gives you a structured way to think about time, effort, and trade-offs.
So yes, you absolutely need to go deep into system design—not just to crack interviews, but to succeed and grow in a technical career. It makes you more effective, more in demand, and more valuable.
Tejas Chopra:
I completely agree—system design is almost a way of life for senior engineers. It’s how you continue to provide value to your team and your organization.
To me, it’s similar to learning math in school. You may not use the Pythagorean theorem directly every day, but learning it shapes how your brain thinks. System design works the same way. It trains you to think from first principles, to understand the why behind technical decisions.
Many of the systems we’ll be working on in the next 10–20 years haven’t even been built yet. But if you have a solid foundation in system design, you’ll be ready to build and scale those systems when the time comes.
We're also at an inflection point—AI agents are evolving rapidly, and it's becoming even more critical to design systems that can support them. You’ll become the go-to person when your organization is trying to figure out how to scale, optimize, or integrate new capabilities.
And one more point—seniority isn’t about writing complex code. It’s about simplifying complex systems and communicating them clearly. That’s what separates great engineers from the rest. System design plays a huge role in developing that capability.
8. Looking back at the first edition of your book, are there any topics you plan to expand or enhance in future editions based on recent tech advancements?
Tejas Chopra:
That’s a great question—and one Dhirendra and I have discussed before. So much has changed recently that we could have easily turned this into a multi-volume book.
One area we’d definitely expand on is how AI and ML systems are reshaping system design. For instance, I was recently working with the Model Context Protocol (MCP), which allows AI agents like ChatGPT to securely interact with data sources like Stripe or Slack. That kind of infrastructure brings new architectural challenges—and opportunities—that we didn’t cover in the first edition.
Another area is observability. We touched on it, but there’s much more to say about how large-scale systems monitor health, track events, and debug issues in production.
And then there’s infrastructure as code—an increasingly important part of designing for the cloud. Tools like Terraform let you define your entire infrastructure, including security and compliance primitives, in version-controlled scripts. We didn’t get to explore that as deeply as we would have liked.
So yes, those would definitely be top priorities for the next edition.
Dhirendra Sinha:
Absolutely—AI is a big one. Understanding how AI integrates with system design is critical, and there’s so much we could add on that front.
Another area we’d expand is data privacy and security in distributed systems. In the first edition, we touched on identity and access control—role-based or attribute-based—but there’s a lot more we could explore, especially with evolving regulations and increasingly sophisticated threats. A comprehensive section on designing privacy-preserving systems would be really valuable.
We also discussed adding a chapter on sustainability and green software engineering. The rise of generative AI has raised serious concerns about energy use. For example, when OpenAI released their Ghibli-style image generator, it quickly hit GPU limits. That’s not just a capacity issue—it’s also an opportunity to ask whether we're using resources wisely.
There’s a lot of scope to think about energy-efficient system design—how we reuse components, minimize compute waste, and design more sustainably. That’s a conversation we’d love to bring into future editions.
To explore the topics covered in this conversation in greater depth—including practical guidance on designing for scale, handling real-world trade-offs, and navigating system design interviews—check out System Design Guide for Software Professionals by Dhirendra Sinha and Tejas Chopra, available from Packt.
Here is what some readers have said: