Designing for Decades: A Conversation with Alexander Kushnir on Longevity, Maintainability, and Embedded Systems at Scale
A MedTech systems engineer unpacks what it means to build software that must survive regulatory cycles, hardware obsolescence, and engineering turnover.
In safety-critical domains, code longevity isn’t a nice-to-have—it’s a baseline constraint. Software must coexist with hardware for ten years or more, while withstanding evolving standards, team turnover, and limited upgrade paths. In this Deep Engineering Q&A, we ask industry veteran Alexander Kushnir about the realities of building and maintaining embedded systems that endure. We explore long-term technical debt, the discipline of software rejuvenation, and why modern C++ idioms are reshaping how engineers think about embedded maintainability.
Alexander Kushnir is a principal software engineer at Johnson & Johnson MedTech, specializing in electrophysiology systems. With about 20 years of experience across medical devices, industrial controllers, and networked embedded platforms, he has worked on everything from motion control firmware and network switches to VoIP and medical devices software . His core expertise lies in embedded Linux, modern C++, cross-platform development, and HW/SW integration. He has also built and lead a 2-day workshop related to CMake.
1: How do you approach the challenge of managing architectural technical debt in systems with 10+ year hardware lifecycles, especially in regulated environments where major refactoring or redesign is costly and risky?
Alexander Kushnir:
Technical debt is actually a real problem. However, we can follow several strategies to mitigate the issue:
Build modular software: This strategy pays off again and again. It helps us to isolate a specific functionality, which makes the task of “replacing the wheel in a moving car” easier.
“Divide and conquer”: Separate your application logic from the hardware-dependent logic. You will benefit from that by being able to run the logic not dependent on the hardware (for instance in a simulator or using software mocks that simulate hardware behavior).
Test, test, test: If you follow the previous advice, you should be able to test the logic on your development PC, not just on your target. Why is that good? You can write and run your unit tests with much shorter cycles (think - compiling, loading, debugging…all this on your PC instead of the device).
Use industry-standard and up-to-date tools: Even though it is not a hard requirement, tools keep evolving, and if you fall too far behind, then when you eventually need to investigate an issue in the field, you may find yourself forced to use newer tools you’ve never worked with—leaving you at a disadvantage.
2: What strategies do you use to mitigate hardware obsolescence in long-lived systems?
Alexander Kushnir:
Of course. It is not exactly my responsibility, but I am in the loop. When designing a hardware platform, the engineer must ensure that the components he chooses have a “long-term support”. Having said that, I prefer to use off-the-shelf System-on-Module (SOM) integrated on a custom board, rather than developing a board with the same CPU (or FPGA) and having to address most basic interfaces such as memory or a flash storage during the board bring-up. This reduces the complexity of board bring-up and makes it easier to handle hardware obsolescence, because the SOM vendor typically manages low-level design, interface validation, and long-term component sourcing.
3: How do you reconcile the need for regular updates (e.g. for security patches or feature improvements) with the need to minimize disruption and regulatory overhead?
Alexander Kushnir:
Every change needs to be justified.
One of the projects I am most proud of was adding a firmware update capability to a device my team was developing.
However, the regulatory burden remains — any update that could affect safety or compliance still requires formal review and, if necessary, re-certification. In practice, we minimize disruption by:
Separating safety-critical functions into a stable, validated firmware baseline that is rarely touched.
Isolating updatable modules (non-critical logic, UI features, analytics, etc.) so they can evolve without impacting certified components.
Using risk-based change management to decide when an update is worth the cost of triggering the regulatory process — for example, prioritizing security patches and critical bug fixes, while bundling minor enhancements into larger, less frequent releases.
In this way, the need to keep embedded software up to date becomes operationally similar to maintaining conventional PC or cloud-based software, but with the extra discipline required for regulated environments.
4: What architectural patterns help maintain software flexibility in these conditions? For instance, have you used hardware abstraction layers, multi-process architectures, or IPC frameworks to decouple software from specific hardware so you can update or add features without a full redesign? How effective have these methods been in extending the usable life of older platforms in your experience?
Alexander Kushnir:
Abstract all you can. Whether one is taking the OOP approach (C++, my love), or a procedural one, abstraction and modularity must be applied. Hardware Abstraction Layer (HAL) is an excellent example of abstraction, as the application logic is not aware of the hardware (for example Linux paradigm took abstraction to the edge - everything is a file, whether it is a network connection, hardware device, or a real file - the user reads from and writes to a file).
Multi-process architecture makes sense when the software has many functionalities, and if one of the functionalities has malfunctioned, it won’t affect other ones. For instance, once I worked on an infrastructure that included a terminal (CLI), database engine, and several more features. So, if the DB engine crashed, the terminal would continue running unaffected thanks to the isolation between processes.
Another tricky multi-process architecture usage is when a programmer needs to utilize a GPL-licensed library in a proprietary environment and is not interested in exposing the code. In such a case they can create a process that links with the GPL-licensed library, and communicates with the main software using a well-defined interface such as pipe, socket or shared memory.
I will repeat myself - abstract all you can. However, you must pay attention to the cost of these abstractions. For example, if you use runtime polymorphism, you’ll need to profile your virtual dispatches to verify that they create no bottleneck in your critical path.
5: How do you decide what to keep backward compatible versus when to break from legacy constraints? Are there lessons from enduring platforms (for example, the VMEbus standard stayed relevant for 40+ years by emphasizing modularity and backward compatibility) that you apply to provide a clear migration path for long-term customers?
Alexander Kushnir:
Well, that’s a tough question. If the device interfaces with the outer world, changing that interface will always be the last priority. However, if changes are inevitable, they can be mitigated. For example, if you think ahead when designing the protocol, you can add versioning so that new features or changes do not affect older generations of devices. In some cases, you can run multiple versions in parallel or provide adapters to bridge old and new systems, giving customers a clear migration path. This approach is similar to what made platforms like VMEbus last for decades—keep the external contracts stable, design for modularity, and plan for evolution without forcing everyone to upgrade at once.
6: In a system meant to last a decade or more, how do you design for maintainability to slow down software aging? Can you share practices you use to avoid “bit rot” that ensure the codebase remains clean and adaptable to new requirements over time?
Alexander Kushnir:
All principles mentioned in my answer to the first question apply here. You can’t avoid software aging, as the ecosystem moves quickly. However, if your system is modular enough, the changes can be rolled out gradually, for instance, refactoring module by module, after testing each one thoroughly.
Additionally, CI tests are a must. I would even say that every pull request should be gated, i.e. only if the pull request passes all the tests, should it be merged. Many developers don’t like writing tests, but as a matter of fact, the tests protect them, and provide developers the confidence to make major changes without breaking things.
7: Have you observed issues like memory leaks, data corruption, or performance degradation creeping in over long uptimes in embedded systems? If so, what proactive fault-tolerance techniques do you recommend to address this?
Alexander Kushnir:
I don’t believe in regular restarts or “scheduled maintenance” where the only action is a reboot. If there’s a problem like a memory leak, it should be fixed—not hidden—especially on a resource-tight device.
Memory leaks are possible, of course, but they can be avoided. In modern C++, for example, using smart pointers eliminates most manual memory management errors. During development, I also recommend dynamic memory analysis tools such as Valgrind, which is still underrated in pre-release testing. Combined with thorough code reviews and targeted stress tests, these measures catch leaks and other resource issues before deployment, reducing the need for reactive “rejuvenation” in the field.
8: What fault-tolerance strategies do you build in to ensure long-term reliability? Can you share how you determine the right level of redundancy or self-diagnostic capability for a design that needs to last a decade?
Alexander Kushnir:
All the systems I’ve built have interacted with a human at some point—whether an operator, a technician, or an end user. In such cases, the most practical solution is a periodic health check, or Built-In Test (BIT), that monitors critical components and manages system state when a fault is detected. Typically, this means indicating the issue to the user—via an LED, buzzer, or display—so corrective action can be taken.
The specifics depend on the criticality of the system. For non-safety-critical designs, the goal is early detection and clear reporting so the failure can be fixed before it escalates. For higher-reliability requirements, BIT can be combined with fault isolation, allowing unaffected subsystems to keep running, or with limited redundancy (e.g., a backup sensor or communication path) to maintain partial functionality. The “right” level of redundancy or self-diagnostics is always a trade-off between cost, power, size, and the consequences of downtime—but even in minimal designs, proactive monitoring and clear fault signaling are essential for long-term reliability.
9: How do you ensure that devices you design today can be kept secure 10+ years down the line?
Alexander Kushnir:
Like I’ve mentioned before, one of the features I’m most proud of is the firmware update capability we built into one of the devices I worked on. I think this is a crucial capability—not just for delivering new functionality, but also for applying OS and security patches over the device’s entire lifetime.
To keep a system secure for 10+ years, the update mechanism itself must be secure: signed and verified updates, encrypted transport, and a rollback option in case an update fails. In regulated environments, it also needs to integrate with compliance workflows so updates can be deployed without breaking certification. In some cases, it’s wise to design for network segmentation or controlled update channels, so that only trusted endpoints can initiate the process. Without this foundation from day one, long-term patching becomes either risky or impossible.
10: Are there insights or practices—whether from automotive, avionics, or industrial IoT—that you find relevant or transferable to your work? Are there philosophies or practices from other domains that you think MedTech could borrow—or should avoid?
Alexander Kushnir:
I think the processes in MedTech are good, but slow. Code review, documentation, testing—these all have a clear purpose, and they exist for good reasons. But no process has to be sanctified. Code review isn’t done just because “that’s the rule”; it’s done to catch defects and improve design. The same goes for documentation and tests—they’re tools, not rituals.
That’s something I see in other industries as well. Automotive has learned to speed up iterations without skipping the essentials, especially with OTA updates. Avionics shows how you can lock down safety-critical code while still evolving peripheral systems. From these, I think MedTech can borrow the idea of tailoring process intensity to the context—keeping rigorous control where safety demands it, but streamlining where it doesn’t. The key is to always ask: how crucial is this step at the stage we’re in right now?