Today I LearnedRSS

July 2023

2023-07-28
Lecture Friday: How Complex Systems Fail

Notice it's systems, not just software systems, but all systems. These traits are common to all complex systems. Systems too big for any one person to understand every single detail of. Great talk because it doesn't cover the modes of failure, but more the modes of success. How these systems managed to even stay working in the first place.

One of the biggest keys is the anti-fragility of heterogeneous systems. Nature has this figured out. It's our limited human brains that don't like it because it leans into the complexity instead of our desire to overlay simple narratives on chaos. When you have a monoculture, a single type of something, a single vendor, a single dependency, a single person, the whole system is either working or failed. When you have a bunch of variable, different dependencies, the rate of incidence goes up, but the impact of incidence goes down.

If you have one hundred variants, you have an isolation to one percent of the system during failure. They will fail, everything eventually does. The key is containment and the ability to gracefully reallocate the goal within the system when they fail. There's often a lot of ways to route around a one percent failure with minimal impact. When you have even a twenty percent failure, good luck getting the system to suddenly handle absorbing that.

Another key is focusing on the ability to tailor the system. The ability to modify the existing system relatively easily in ways that keep it as stable as the designer can. He talks about lift points on heavy equipment. Someone's going to try and lift it, marking where that's safe helps significantly. The normal world is not well behaved. You can't account for every possible use of a complex system. Allowing the world to adapt the system is critical for keeping the system working long term.

One key focus is on systems as imagined versus systems as found. I've found it all too tempting myself to think of systems as imagined. To assume I know why something's happening or how to fix it without operating in the true system as found method, which is to go experiment. Systems as found are based on monitoring, responding, adapting, and learning. Data, analysis, decision, action. This is how dynamic systems work. It's control theory all over again, the need to close loops.

I also love the distinction between reliability and resilience. The key being that reliability is that the system doesn't fail, resilience is that when it does (again, it always does), it smoothly adapts and recovers. Essentially, how smooth you can make that phase tradition. A system that rarely goes down often has very abrupt boundaries between working and failure because the control loops there are immature or even non-existent. Suddenly people who don't deal with failure are having to deal with failure and it leads to long jarring recoveries.

2023-07-21
Lecture Friday: The Future of Programming

2023-07-14
Lecture Friday: PID Loops and the Art of Keeping Systems Stable

2023-07-07
Lecture Friday: Real Software Engineering