How many systems make up a typical software application these days? I suspect it’s more than a few. With the advent of the internet, then web-services and now the cloud, few applications are self-contained any more. A recent analysis of our own cloud service made me realise that software engineering has become more like systems engineering, or as the title of this blog states, a hybrid that is software systems engineering. An excellent book I read this summer, Release It, reinforces this view.
What is systems engineering?
My first proper job after university was working in a fairly special avionic systems team. They were formed to rapidly upgrade the aging Jaguar attack aircraft’s avionics, for service in the Bosnia conflict. Our leader was inspiring and won an MBE for his work. I’ll never forget meeting him and how he introduced systems engineering as being the discipline that “makes the whole greater than the sum of its parts”.
Modern aircraft are a classic example of this kind of complex system, being made up of component parts such as engines, flight computers, GPS and radar. They also demonstrate the importance of resilience and graceful degradation: if one part fails, the whole does not fail, but may reduce in the overall value it provides.
Fast forward nearly twenty years and as I look at even the most basic software application I’ve written recently, the systems engineering definition and principles echo round my head again.
Applying systems engineering to software
To illustrate this, I’ll use one of our simplest, internal web applications: our release note generator. I blogged about this app previously, but here’s a quick re-cap of what it does. Given a release identifier (a subversion revision, in our case) the application summarises how this release differs from what’s currently in production. The summary is a list of each related work-item (story or defect), including the work-item’s id, title, state and type.
Even though it’s a very simple web app, as the picture shows, it still depends on three other software systems:
Production version provider
Source control system
Work-item repository (an external, cloud service)
This application is thus a software system, with component parts that can, and probably will at some point, fail. Our challenge as software systems engineers is to minimise the impact of any part failing.
Failure modes and how to handle them (gracefully)
The three systems that the application depends on are similar enough that they have the same general failure modes:
Timeout. The system responds, but exceeds the allowed time limit
Network or server error - aka “computer says no”. The system either doesn’t respond at all or responds with an error, for example, HTTP Status 500 - Server Error
Unexpected response. The system responds but in a manner that the application hasn’t catered for. For example, it expected a numerical response, such as 1.23, but instead received some text, such as 1.23A
(Note that these failure modes are very general - Release It details a much greater number of specific failure modes, but they are beyond the scope of this post.)
If we fail to consider how to handle these general failure modes, then the application will handle them for us, with its default mechanism. At worst, this could be showing the user the ugly Yellow Screen of Death (YSOD); at best it will be a generic error message that leaves the user confused, frustrated and helpless.
When we consider these modes, it becomes apparent that although the application’s dependencies all have the same failure modes, their impact and mitigation very much depends on which system fails, as the table below shows. Even for an airplane, the failure of some systems (such as all its engines) will cause the airplane to eventually drop out of the sky!
The timeout failure mode for the work-item repository warrants further discussion. A release note typically contains the details of 10+ work-items. To retrieve these details, a separate HTTP request is sent per work-item. If each request encounters either a timeout (of, for example, 20 seconds) or a delay (of, for example, just under 20 seconds), then a poorly engineered implementation that serialises these requests would result in a considerable delay (of, for example, 200 seconds) before the user sees any response at all! On the other hand, a well-engineered solution, that performs these requests in parallel, could both optimise performance and provide more graceful handling of this failure mode.
Conclusion: know your application’s systems and plan for their failure
I’m hoping that this post has given you two things to consider and act on:
Most software applications today are complex systems. As software systems engineers, we should know and care about the systems our application depends on
Complex systems have component parts that can and will fail. The world will be a nicer place if we, as software system engineers, plan for this, and build applications that gracefully degrade instead of metaphorically falling out of the sky