Dramatically Reduce IT Errors With This Simple Tool

In the early 2000s, the medical staff at Johns Hopkins Hospital implemented a tool that reduced patients enduring pain from 41 % to 3%, pneumonia occurrences by 25%, and patient deaths by 21 from the previous year. The tool they implemented - a checklist.

To err is human...

But in medicine, aviation and many other professions, human error could be life altering, or worse case, life ending. For us in Information Technology, keeping systems running (maintenance, monitoring, troubleshooting) can sometimes parallel aviation and fixing systems that are down or ailing can parallel medicine.  However, in IT, human errors may not lead to similar dramatic results as in aviation and medicine, but most likely will drain money and time. The philosophers Samuel Gorovitz and Alasdair MacIntyre, published an essay on human fallibility in the early 70's.  Some things, they said, are beyond our capacity  - we have physical and mental limits.  However, in the areas where we do have control, there are two reasons for human fallibility:

1. Ignorance - we err due to not having the knowledge that would have prevented the error.

2. Ineptitude - we err due to the incorrect application of knowledge we have.

In the book, The Checklist Manifesto, Atul Gawande, author and surgeon, writes about his discovery on the value of checklists and his attempts at using them to overcome the fallibility from ineptitude in the operating room.  No one in the operating room is inept, and probably far from it, but they are all at times susceptible to ineptitude.  A good part of this post is sharing what I learned from the book.

A problem can be one of three types:

  • Simple - these are ones where the problem is solved by following a recipe, where each step is followed by the same step and then another each time.
  • Complicated - these are the ones like sending a rocket into space or surgery.  They take many people/teams with many different skills, specialized expertise, and may be composed of simple problems that can be repeated but but each case may be different due to variables and quite often the unexpected is encountered.
  • Complex - these are problems like raising a child where applying the same set of steps with a different child may not work at all or may even be disastrous.  Each case is unique.

Checklists are excellent tools for both simple and complicated problems, which cover many of the problems we face in our work environments.  Let's take a look at two examples, one in aviation and one in medicine, and then see if the lessons learned can be applied in Information Technology.


In the fall of 1935, the U.S. Army Air Corps held a competition to select the next long range bomber. The overwhelming favorite to win was Boeing's Model 299.  With four engines instead of the typical two, the capacity to carry five times as many bombs, fly faster and twice as far than any previous bomber, it seemed to leave the competition on the runway.  It took off cleanly, flew to 300 feet and crashed into a fiery explosion killing two of the five crew, including the highly skilled pilot.

The investigation revealed no mechanical error, but found the cause to be pilot error.  The significant advances in the plane made it more complicated to fly with much more to manage - a newspaper article dubbed the plane as too complicated to fly.  The Army went with the smaller design from Douglas Aircraft, and as a result, Boeing almost went bankrupt.  However, the Army did buy a few of the 299 for further testing.b17_aluminum_overcast.jpg

As you may suspect the story did not end there.  The test pilots decided to try out a checklist, to aid with flying the plane and remove the errors. Additional training was not really needed so they isolated the key steps and only the key checks for takeoff, flight, landing and taxing that would fit on an index card.  All steps every pilot already knew - the dumb stuff. But as it turns out in so many cases, in every field, it's usually the dumb stuff that is skipped.  This is true in flying planes, surgery,police work, firefighting, and  in IT operations.  

The pilots instituted the checklist and the discipline to use them and went on to fly 1.8 million miles without an accident.  The story ends well for Boeing as the Army ordered 13,000 of the aircraft.  The renamed B-17, went on to play a pivotal role for the Allied Powers in WWII.  Even more importantly, the checklist became embedded in the aviation and later the space industry where accidents are extremely high profile, but extremely rare. The checklist played a major role in making accidents rare.

WHO  Surgical Checklist

 In 2006, a woman from the World Health Organization (WHO) called Dr. Gawande with a simple request.  The world was seeing a large increase in surgeries and a significant portion of the care was so unsafe it was a public danger - can Dr. Gawande help develop a global program to reduce avoidable deaths and harm from surgery.  Perhaps thinking - WHAT? - he initially said no but later decided to take on the problem.

Medicine, like almost all other professions, has advanced over the years by gaining a significant body of knowledge and through increased use of cutting edge technology.  As a result an operating room that may have been staffed by a doctor and nurse is now staffed by a large team of specialists and even sub specialists.  Surgery combines both the complicated and complex set of problems.  The unexpected usually happens and their are unique situations with every patient.  This means success not only requires making sure dumb steps are not missed but also requires teamwork and coordination to deal with the unexpected and uniqueness of each case.  In many cases several of the team may not have worked with many of the others.  Dr. Gawande, having known about the Johns Hopkins Hospital's and Boeing's experience with checklists decided to develop one for the WHO.  

Gawande and his team traveled to eight hospitals around the world to better understand the challenge each was facing.  They chose the hospitals in order to represent a a wide spectrum of circumstances - wealthy countries, poor countries, large well staffed, large understaffed, etc.  Each hospital had it's own view of what was critical but they narrowed it down to the three so called global killers in surgery - infection, bleeding and unsafe anesthesia.  Then added steps to help coordination and communication which are important in anticipating complications.  The resulting 2 minute - 19 step checklist is shown below.


In the middle column the first check is to confirm all team members have introduced themselves by name and role.  A previous study found through a controlled study that surgical teams that introduced themselves had more positive outcomes then ones that did not.  Exactly the reason for that is not known but the correlation is enough to make the checklist - a win for teaming. The result shown was created after many uses in actual surgery and in simulated surgeries.  The checklist was implemented in 2008 which involved the hospital leadership, training material, video's etc. as the checklist on it's own would not improve anything.  Only through disciplined use did it have even a chance of working, if it would at all.

The other important aspect of this checklist is the Anticipated Critical Events section. In complicated and complex problems overcoming the simple parts of the overall problem like reviewing for known allergies is not enough. Communication within the team becomes an important pause point.  The pause point of a checklist is when you stop and confirm items on the list or in this case there is a communication step within the team to overview any critical items that need to be addressed or has come up before proceeding.  

Gawande was nervous and was not expecting any major improvements.  However, the results from the initial study of the 8 hospitals and 4000 surgeries were very impressive.  Across the hospitals there was a 36% reduction in post surgery complications and 47 % reduction in deaths.  The results were similar across all the hospitals. When the staffs of all the hospitals were asked if they were a patient would they want their operating team to use the checklist, 93% responded with yes.



An IT Story

Once upon a time in a city far away, a great company, ABC Manufacturing, had a problem.  Joan from Customer Service called the help desk to let them know BestAPP just went down and she can't get back on.  It's been a few months since it went down, so Bob at the help desk asked Sanjay, his colleague, if he is supposed to email ACME or call them. Sanjay could only remember the email, so Bob used that to contact them about the system being down.  A 1/2 hour later as Bob and Sanjay were processing the 30 other tickets they received on the system being down, the IT Director, Tran, calls to ask what's going on.  Bob said he was still waiting for ACME to respond to his email.  Trans says, "Well you are supposed to call them for emergencies".  Bob asks if Tran has the number.  "Of course I have it", says Tran. Bob asks for it. "Let me call you back", says Tran and hangs up.

Ten minutes later, Tran calls Bob back with the number.  Bob reaches Sally at ACME who says she will jump on the system and find out what's wrong. After 20 minutes Bob gets a call from Tran to find out what's going on.  While they are on the phone, Tran gets an email from the VP Operations asking when the system will be up.  Bob had sent Sally an email a few minutes earlier but did not get any reply.  After Bob hangs up with Tran he calls Sally and asks why she didn't respond to his email.  Sally said she was heads down working on the issue so didn't see the email.  Then Bob starts working with Sally and ignores the 5 emails he gets asking about the system being down.  

Within a half hour Sally and Bob figure out the problem and get the system back up.  At the same time Joan from Customer Service calls Bob and he relays the good news.  Tran comes into Bob's cubicle and gives him a high five after hearing the news. Tran walks back to his office to see the email from VP Operations asking if the system is up. He replies back with yes and wonders if this process could have been improved.  He realizes he has a meeting to go to and leaves to grab some coffee before heading to the conference room.  After the meeting, Tran wondered about what to get for lunch and forgot to go back to his previous pondering.

Yes, I did make uhigh_five.jpgp this story but it's an amalgam of experiences with ineptitude that I have either unfortunately actively participated in or observed over the years when an IT process fails. It happens in critical emergencies but also on routine every day processes. The technical aspects of troubleshooting and resolution may work just fine as it did in the IT story but the overall process of dealing with a system down had many flaws - who to call or email, important contact information, how to deal with communication during and after resolution,  In the story, judgment was key to every step.  In troubleshooting the problem, Sally's and Bob's expertise and judgment was the key to the solution. There's a place for judgment  - that's why expertise and skills matter.  But there is also a place for protocol where judgment is not needed and if called on may make the wrong decision.  That's where a good checklist would have been valuable in this IT story.

Why Checklists Work

To repeat - we are not inept but we are all bound to experience ineptitude. How often have you been interrupted when speaking, even for just a few seconds, and you start back by asking "what was I talking about".  Why does every log-in also include a "forgot your password" link. Our memory and attention are both fallible, especially for the routine matters that compete with the more pressing events in our professional and personal lives.  Secondly, it's easy to skip over steps even when we remember them because those steps may not matter all the time and may not even matter the majority of the time.  When they do matter they may matter big. 

Checklists help us overcome both of these fallibilities by instilling disciple and routine to always step through the critical steps.  In many cases using the checklist doesn't result in a quick check off of each item.  You find something that needs to be addressed and sometimes that discovery could have a huge impact on the eventual outcome. 

When you think about it, a good checklist is a solution to a problem before it becomes a tool.  Someone went through the analysis of errors and determined which steps are the keys to include based on the severity of the outcome resulting from the error or the probability of occurrence.  

Best Practices for Checklist

In writing The Checklist Manifesto, Gawande went to the guys that wrote the book on checklists to help him design his checklist for surgery.  He visited with Daniel Boorman, a veteran pilot, working for Boeing.  The best practices according to Boorman are:

  • Decide on the pause points in the process where the checklist will be used.
  • Will it be a DO-CONFIRM or READ-DO type of checklist.  DO-CONFIRM are where you do the work from what you know and then stop and confirm the key steps are completed correctly.  The READ-DO are more like a recipe where you complete and then check off each key step.
  • Keep the list to between 5 to 9 items.  There is no perfect number but mainly shorter is better than longer.  A checklist is NOT a SOP. Keep it only to the "killer items" that if missed are bad.  In aviation and medicine they actually mean that literally.   
  • Wording should be exact and not vague.
  • Keep it to a page or a screen view whenever possible.  
  • For complicated problems add pause points for communication related steps like Anticipated Critical Events in the WHO example.
  • It's important to decide who is driving the checklist.  In surgery it's better for the Surgeon not to do it and the same goes for the Pilot where the Co-Pilot may be the one to drive the checklist.  In some situations like the WHO example it's important to go through the list verbally. Multiple people can improve discipline and effectiveness but even one person using a checklist in key situations is better than winging it.
  • The time to complete the checklist should not take very long, ideally a few minutes at most.  This does not include the duration of actual tasks that are completed in the process.
  • Checklists may start on a whiteboard but to be truly be effective they have to be tested, used and adjusted based on results.  (I paraphrased - Boorman said something more akin to - they must be tested or simulated in use as sometimes they are made by desk jockeys with no awareness of the situations they are to be deployed.)
  • The existence of a checklist by itself is of no use. In aviation, the checklist works because it is a vital part of the training as well as used and updated on a consistent basis.

Concluding Remarks

We are pre-wired to believe that complex problems require complex solutions.  How can a simple checklist help in situations where many smart, highly trained, experienced people are working hard to solve complicated and complex problems?  But we could be wrong.  The operating room and the flight deck are staffed by highly trained experts yet checklists have played a role in making both safer by reducing the impact of human error.  Sometimes simple solutions do work for complicated problems.

The lowly checklist has been proven to prevent major accidents, save lives and even help in IT.  Yet it really hasn't caught fire like one would think.  Perhaps it's our bias against simple solutions for complicated problems.  Perhaps we feel our expertise is such where a checklist is beneath us.  I have another view.  The checklist is our way to overcome two of our weaknesses - fallibility of our memory and our natural inclination against discipline.  And you need discipline to create a checklist.  

In aviation, the incentive for using checklists are pretty high - the cost of not using it may be your own life.  In medicine, using it may preserve the life of your patient. In IT, when we make errors, most of the time it doesn't involve human life and death as it does in aviation and medicine - we can bring our patient back to life and move on after our high fives.

Rarely do we stop and say what went wrong, was it a case of ineptitude, is there a way to remove that from happening again - would a checklist help?