Skip to content

Release It!

Metadata

  • Author: [[Michael T. Nygard]]

Highlights

Software design as taught today is terribly incomplete. It only talks about what systems should do. It doesn’t address the converse—what systems should not do. They should not crash, hang, lose data, violate privacy, lose money, destroy your company, or kill your customers. — location: 225 ^ref-42282


Product designers in manufacturing have long pursued “design for manufacturability”—the engineering approach of designing products such that they can be manufactured at low cost and high quality. — location: 249 ^ref-19437


I guarantee that an architect who doesn’t bother to listen to the coders on the team doesn’t bother listening to the users either. — location: 304 ^ref-64098


Enterprise software must be cynical. Cynical software expects bad things to happen and is never surprised when they do. Cynical software doesn’t even trust itself, so it puts up internal barriers to protect itself from failures. It refuses to get too intimate with other systems, because it could get hurt. — location: 610 ^ref-44747


robust system keeps processing transactions, even when transient impulses, persistent stresses, or component failures disrupt normal processing. This is what most people mean by “stability.” — location: 634 ^ref-12563


The terms impulse and stress come from mechanical engineering. An impulse is a rapid shock to the system. An impulse to the system is when something whacks it with a hammer. In contrast, stress to the system is a force applied to the system over an extended period. — location: 636 ^ref-14904


The major dangers to your system’s longevity are memory leaks and data growth. Both kinds of sludge will kill your system in production. Both are rarely caught during testing. — location: 650 ^ref-31750


Therefore, if you do not test for crashes right after midnight or out-of-memory errors in the application’s forty-ninth hour of uptime, those crashes will happen. If you do not test for memory leaks that show up only after seven days, you will have memory leaks after seven days. — location: 653 ^ref-6804


No matter what, your system will have a variety of failure modes. Denying the inevitability of failures robs you of your power to control and contain them. — location: 674 ^ref-37786


Once you accept that failures will happen, you have the ability to design your system’s reaction to specific failures. Just as auto engineers create crumple zones—areas designed to protect passengers by failing first—you can create safe failure modes that contain the damage and protect the rest of the system. This sort of self-protection determines the whole system’s resilience. — location: 675 ^ref-27216


The more tightly coupled the architecture, the greater the chance this coding error can propagate. Conversely, the less-coupled architectures act as shock absorbers, diminishing the effects of this error instead of amplifying them. — location: 704 ^ref-9782


Triggering a fault opens the crack. Faults become errors, and errors provoke failures. That’s how the cracks propagate. — location: 721 ^ref-37836


One way to prepare for every possible failure is to look at every external call, every I/O, every use of resources, and every expected outcome and ask, “What are all the ways this can go wrong?” — location: 726 ^ref-24247


Both camps agree on two things, though. Faults will happen; they can never be completely prevented. And we must keep faults from becoming errors. You have to decide for your system whether it’s better to risk failure or errors—even while you try to prevent failures and errors. — location: 741 ^ref-6132


Over time, however, patterns of failure do emerge. A certain brittleness along an axis, a tendency for this problem to amplify that way. These are the stability antipatterns. Chapter 4, ​Stability Antipatterns​, deals with these patterns of failure. — location: 746 ^ref-21495


Chapter 5, ​Stability Patterns​, deals with design and architecture patterns to defeat the antipatterns. These patterns cannot prevent cracks in the system. Nothing can. Some set of conditions will always trigger a crack. But these patterns stop cracks from propagating. They help contain damage and preserve partial functionality instead of allowing total failures. — location: 749 ^ref-53537