For years, automation has been heralded as the final solution to IT’s most troublesome puzzle: human error.
Although the benefits of automation are indeed significant, it also brings its challenges. Of course, automation has been shown to successfully save time for humans and allow that time and energy to be focused elsewhere. However, the assumption that because these processes have been automated, they are now free of human error is a dangerous one.
After all, the automated system was built by a human, and it obeys the commands given to it by its human creators. Therefore, any instructions provided via the user interface, a configuration file, or script will be carried out by the system, even if they are erroneous. The solution is ensuring that all systems are monitored by building in pervasive observability from the start.
Human error at the source
Automation allows us to build and operate systems with enormous scale and complexity. It allows enterprises to respond to opportunities and threats in their business environment with speed and agility. These kinds of tasks would take a human a very long time to complete, but an automated system can run continuously and quickly in the background.
A simple example of an automated system is a Microsoft Excel spreadsheet. Faced with a vast table of figures, you could do calculations by hand, adding up all the rows and columns and multiplying or dividing them as needed. You could put these findings into a table or graph and even go further to perform “what if” analyses again and again. This effort would take a human a long time, but it would take the humble spreadsheet a matter of microseconds.
However, to get the correct results, you have to give the spreadsheet the proper instructions. A mistake in the request will result in an error that could be catastrophic. Indeed, in a now-infamous example, economists Carmen Reinhart and Kenneth Rogoff published research results in 2010 that would influence government policy across the world. Following the 2008 economic crisis, they had conducted research looking at the correlation between a country’s debt level and its potential for long-term growth. They concluded that countries with a debt to gross domestic product (GDP) ratio of over 90% would grow more slowly. These results influenced politicians around the world and thus began an era of austerity.
Unfortunately, they had made a mistake. Reinhart and Rogoff had used a Microsoft Excel spreadsheet and had made an error in their formula range – instead of averaging the data from 19 countries, the spreadsheet only included 14. The result was that their conclusions were much more pessimistic than they should have been. Reinhart and Rogoff fixed this mistake too late, and indeed only when three other researchers pointed it out to them. We can only speculate whether austerity would have been as harsh as if these results had been corrected earlier.
This example shows that even in the form of automation as common as the spreadsheet, human error at the instruction level will permeate through the calculations and lead to incorrect and potentially devastating results.
Managing human error with observability
Observability – which I like to think of as the degree to which you can understand the internal state of a system from its inputs and outputs – enables visibility, which enables errors to be detected and remedied quickly when combined with continuous monitoring. This applies both to the system’s initial programming – the instructions given to the automation – and any problems with the makeup of the system itself. The traffic flowing on an IT network provides an excellent source of observability for the applications and services connected over that network.
Automated systems in an IT environment are created by instructions given by a human. It is estimated that for every 1,000 lines of code they write, programmers introduce on average 15 to 50 bugs. That is a lot of scope for human error. Most of these bugs will be detected and remedied through testing, but they won’t all be caught. The same applies to mistakes in the instructions in the fabric of the automated system. Of course, none of these errors will be detected if there is no testing.
In the case of Reinhard and Rogoff, this is precisely what happened. They didn’t share their spreadsheet with anyone to review, and we all know how that turned out. Pervasive visibility derived from observability is the key to avoiding potential damage to businesses.
Even small errors can have an enormous impact. If you consider that complex IT environments usually share data, one small mistake in one system has the potential to spread and cause errors amongst larger supersystems and smaller subsystems. Observability, therefore, is vital at both the micro and macro levels.
Monitoring errors in subsystems
There is a correlation between the potential for error and the number of systems interacting and sharing data. Therefore, it is crucial to understand the interactions between these interconnected systems to avoid unexpected results.
If, for example, system A and B are brought together because the systems’ owners decide that there is potential to be gained from sharing their data, interaction between the two may be enabled. However, this does not account for system A being connected to system C, and likewise, system B being connected to systems D and E. This creates a “supersystem”.
Indeed, even if the creators of this supersystem are aware that they have done so, it is possible that systems C, D, and E, could be further connected to other systems. The interactions between these systems can create a feedback loop – where a chain of connections is completed, and the end then loops back to rejoin the start.
The accidental creation of feedback loops can have unanticipated consequences, especially if any of these systems contain undetected errors. Observability must be built into systems at the start to understand the interactions between multiple systems and their potential impact on a supersystem. It is too dangerous to simply assume that an automated system will eliminate human error when the impact of errors can quickly become so far-reaching in such complex environments as IT systems.
Businesses and IT professionals would do well to take a sober look at the potential consequences of automation errors before deploying solutions and, most importantly, ensure automation is not granted unobserved access across their IT infrastructures.
Paul Barrett
Chief Technology Officer, Enterprise, NETSCOUT