The Principles of Resiliency: Hardware Eventually Fails, Software Eventually Works

By: Michael Puldy| - Leave a comment


Last week, I visited the Department of Computing at Clemson University to discuss resiliency. It’s been almost a decade since I stepped foot on a college campus and had the pleasure of connecting with a new generation of computer scientists.

Based on a recent article published in Quartz, computer science jobs are hotter than ever. In 2015, about 59,000 computer science majors graduated in the U.S. — and with about 530,000 opportunities in the field available nationwide, their careers are some of the best-paid and highest in demand.

When I graduated with my computer science degree, I had no idea what it meant to incorporate resiliency into my code or technology designs. Today’s graduates are no exception. I took a quick poll of over 100 computer science juniors, and with a few exceptions, today’s students just aren’t aware of business continuity, disaster recovery or even resiliency.

Consequently, for those interested in a dipping their toe into our world, I’ll relay what I told the Clemson University junior class. Here are the three key principles for resiliency in design and operations.

1. Always-On Availability

Whether you’re talking a mobile phone, an app, a server or an infrastructure in the cloud, customers want and demand 100 percent availability. This means technology must be autonomic, in the sense it can self-correct internal failures, reroute transactions and anticipate problems before they happen. On the surface, this is a simple goal.

2. Risk Versus Budget

Always-on availability is expensive, though. It takes time and money to build perfection, and dealing with risk gone wild requires even greater investment. If a company had an unlimited financial budget and an infinite amount of time to construct a product, it could reduce risk to zero. But unfortunately, the time and finances required to eliminate risk are scarce resources, so business managers, architects and programmers are constantly building technology that focuses on minimizing risk. Even in school, projects have deadlines, and you can’t always make code perfect.

Businesses want to be quick to market, which requires sacrifices in code development, limited test cases and reduced time to test. It’s this trade-off between code thoroughness and robust testing that ultimately reduces code quality, so problems develop when technology is introduced to the world. Hardware and software technologies are always released with bugs — they’re just not always seen. This brings me to one of my favorite quotes by Michael Hartung, IBM Fellow Emeritus: “Hardware eventually fails, and software eventually works.”

3. Structured Change and Problem Management

All problems are not equal. When code is released to the world, hundreds if not thousands of defects are discovered. Some problems are operationally critical, effect a large community and require immediate resolution. Other issues may be functional or impact a small percentage of the client base.

Problem management must be structured with a known root cause so critical problems can enter the development process quickly. Without a proper root cause, code fixes returned to the field may accomplish nothing other than client satisfaction issues. Sometimes, the root cause is never found. It’s rare, but it happens.

Code enhancements and problem fixes are returned to the field via a change management process. While a code change may fix a known problem, an improperly tested change could also create new issues.

Make Sense? Or Am I Speaking Greek?

Now, what I highlighted above is probably basic knowledge to my continuity colleagues, but to most people in our industry or looking to enter it, I might as well be speaking Greek.

Based on my post-lecture feedback, I think I touched around 5 percent of the 100 students I met. Nevertheless, I like to think that as more students evolve their careers, these key principles will become relevant in their code development and system design — and ultimately result in a better technology experience.

I’ve spent over half of my career in this industry, and the other half I’ve devoted to either understanding the importance of resiliency or dealing with the aftereffects associated with bad designs and poor planning.

What’s your destiny?

Topics: , ,


About The Author

Michael Puldy

Director of Global Business Continuity Management for Global Technology Services, IBM

Michael is responsible for long term strategy, tactical guidance and governance for business continuity management and resiliency programs across the globe at IBM. For the majority of Michael's career, he has focused almost exclusively on business resiliency. From his personal experience in the financial industry through his services and product tenures at IBM, he has experienced the resiliency business through the entire spectrum. From restoring data centers and client services at time of crisis, to developing technical and business solutions, through the industry's evolution to cloud, cyber resiliency, and shared resiliency solutions, Michael is able to bring forward a real-time perspective on today's key issues in the resiliency space.

Articles by Michael Puldy
See All Posts