This is the first installment in a three-part series on the correlation between reliability and safety.
Why Explore this Potential Correlation?
I recently presented at a conference called the Human Performance, Root Cause & Trending (HPRCT) conference. I listened with great interest to a presentation on Human Performance Improvement (HPI) by Dr. Todd Conklin and Dr. Sidney Dekker, advocating a 'Learning Team' approach. I had come to the conclusion at this conference that these new learning teams were being viewed as the basis for Human Performance Investigations. These learning teams were certainly being positioned by the speakers as a replacement for traditional RCA as known in the Maintenance and Reliability fields. So I wanted to know more, what is HPI?
Human Performance Investigation: HPI strives to understand and explain what happened without judgment, in order to understand the story and to provide a just and honest conclusion in each case. This gives the organization information that is incredibly comprehensive, makes it easier to identify what to correct than with ‘old school’ methods (Conklin, 2014, p. 45). HPI constructs the event context, and looks not at the individual pieces but at the relationships between those pieces (Conklin, 2014, p. 68).
So essentially to me, this was contrasting the basis between a Safety investigation and what we would call in Reliability as a ‘Root Cause Analysis’ or RCA. Being in the RCA business, this naturally piqued my curiosity.
I was not very familiar with this HPRCT conference but I quickly learned that it was predominately attended by progressive Safety professionals in high hazard industries (especially power generation/nuclear).
This was the first time I had heard Root Cause Analysis (RCA) referred to as 'old school' and ‘obsolete,’ not to mention this was expressed by leading researchers and academics. This got me to thinking, given I have been in the RCA business for decades, is what I do for a living…obsolete?
To be honest, up until this point I had always assumed there was a direct correlation between Safety and Reliability, but I now realized that not everyone outside of the Reliability field feels the same. So I sought out to understand why the differences in perspective exist; and is there a valid correlation between them?
An Ironic LinkedIn (LI) Post Caught My Attention
Shortly after this conference, I came across this graphic (See Figure #1) used in a LinkedIn post. It was quite a hot topic based on the responses it received.
Figure #1. The Application of the Heinrich Pyramid to the DIPF Curve
Now this graphic drew the following conclusions in the cited posted paper:
- The probability of an injury is significantly increased with non-routine maintenance activity resulting from equipment failures.
- Connecting the importance of human safety to the importance of equipment reliability is critical in driving an injury-free culture.
While this appears to make logical sense on the surface, is it true? Does a direct correlation exist between Reliability and Safety as these conclusions suggest? I wanted to understand the reasoning as to why experts in the Safety world would not agree with this expression of such a correlation.
It is a very prevalent position in Safety that the Heinrich Pyramid has been debunked for decades, so that is one reason they would likely not totally agree with the overlay of this Safety curve.
In an article entitled, ‘Examining the Foundation: Were Heinrich’s Theories Valid and Do They Still Matter?’ James Howe (Safety Solutions in Medford, OR) is quoted as stating the following:
“The pyramid theory has really done a disservice to the safety profession because it has misled people running safety programs into thinking that if they work on minor incidents, major incidents will go away. And many, many companies are aware that that is not the case. In fact certain companies with award-winning low injury rates have suffered some of the worst catastrophic incidents during the past 10 years.”
So as you can tell, there is no love lost for Heinrich’s research to many in the Safety community. However, I am looking in generalities to see if there is a valid correlation between injury rates and organizational Reliability, and not seeking a debate on the validity of Heinrich’s pyramid.
Keep in mind as you read this paper that comparisons are being made between the perspectives of Safety researchers/academics and that of career Reliability practitioners in the field. I think those dynamics play a role in the world view of both perspectives.
The Safety Research Perspective
As part of my exploration, I read Dr. Nancy Leveson’s ‘Engineering for a Safer World: Systems Thinking Applied to Safety.’ Dr. Leveson is a highly respected researcher and her text is a very well-respected one that is considered the ‘Safety Bible’ by many. I will add that I thoroughly enjoyed the read and learned a great deal. I pulled the following relevant excerpts from this text:
"Assumption 1: Safety is increased by increasing system or component reliability. If components or systems do not fail, then accidents will not occur.
This assumption is one of the most pervasive in engineering and other fields. The problem is that it is not true. Safety is a system property, not a component property, and must be controlled at the system level, not the component level. (Leveson, 2011, p. 7)
Her proposed ‘New Assumption’ was stated as:
New Assumption 1: High reliability is neither necessary nor sufficient for safety. (Leveson, 2011, p.13)"
This contradicts the common belief that there is a direct correlation between Safety and Reliability. I personally, being in the Reliability field for 30+ years, have always believed there is a correlation between Reliability and Safety, but I would assert it is not a direct correlation. This is because we can have a reliable operation and it still be unsafe, and we can also have a safe operation that is unreliable. As a word of caution, please note that a correlation is not necessarily causation.
But I firmly believe (and have experienced) that a reliable operation is inherently a safer operation, as opposed to an unreliable one. In a reliable operation, there are fewer stops and starts and unexpected situations that deviate from control systems in place (requiring a reactive response). It stands to reason then, under reliable conditions, there are fewer needs to quickly correct a deviation from a standard or norm.
However, Reliability is viewed by many in Safety as strictly a component property and as not having system properties (as Safety does). Many in Reliability would take issue with that assumption. But we have to concede that while we experience safety incidents due to poor Reliability, we also experience Safety incidents that have nothing to do with operational (component) Reliability. Injuries occur all the time in areas unrelated to the operation of an industrial facility.
As the DIPF curve clearly expresses (Figure #1), there are many facets to an effective Reliability process. For the purposes of trying to draw this correlation, I wanted to focus on understanding what detracts from optimal Reliability? If we better understand the systemic reasons why we have unexpected outcomes, would closing that gap make our workplace safer?
I have come to learn that based on the perspectives and definitions regarding Root Cause Analysis (RCA) in Safety, their approach and goals are different than those in Reliability. This is important because in the Reliability field, effective RCA is critical to optimizing Reliability. We have to ‘control the fix’ and not let the ‘fix control the operation’.
Based on definitions and descriptions I have read and heard at these Safety conferences, many seem to equate all RCA as being equivalent to the comprehensiveness of the traditional 5-Whys approach. They view ‘RCA’ as always following a linear path. Unfortunately that is not how failure occurs in the real world. The only RCA approach I know that is strictly linear, is the traditional application of the 5-Whys. Any investigator worth their salt knows that most failure paths occur simultaneously and converge at some point to cause a bad outcome.
Safety also views ‘RCA’ as a tool where the deliverable is a single ‘root cause’, and that identified root cause is usually mechanical in nature (at the component level). Again, that perception of ‘RCA’ is simply inaccurate when compared to the realities of the proper field applications of RCA by seasoned investigators.
As Reliability is mischaracterized by Safety as not having holistic properties (viewing an organization as a system, not merely a series of mechanical components), it appears the same type of mischaracterization is taking place with their grossly limited view of ‘RCA’.
The traditional 5-Why’s approach (See Figure #2) simply lacks the depth and comprehensiveness to effectively analyze the more serious and complex incidents. While the 5-Why’s has it positive attributes when applied under appropriate conditions, its expression of linearity and a singular cause limits its capabilities when investigating complex incidents with simultaneous paths to failure and complex interdependencies.
Figure #2. The Traditional 5-Whys Approach
Safety also appears to have a different approach and purpose to conducting an RCA. This perspective is based on personal observations; when there is a reportable injury and/or fatality, typically the wheels quickly go in motion to first ensure that all appropriate policies and procedures are in place to meet regulatory requirements. The first priority is often to ensure the proper safety controls and infrastructure were in place and therefore the corporation is less likely to be liable for any present and future claims. Once that base is covered (knowing all the paperwork and ‘rules’ were in place at the time of the incident), the search moves towards ‘who’ violated the rules/controls. This is typically followed by blame and discipline. Again, I speak in generalities, because there are much more progressive organizations who do not ascribe to this particular approach to a Safety investigation. Certainly advocates of Human Performance Improvement and Just Culture do not support this blame mentality, but seek a system’s understanding as well.
This bottom-up approach to understanding bad outcomes is the opposite of what I am used to in the Reliability world. As evidenced by the excerpt mentioned earlier from Dr. Leveson, many in Safety do not seem to believe that human performance and system’s thinking are a critical part of an effective Reliability strategy, when in fact Reliability is not strictly component based, but systems-based.
Figure #3. Differing RCA Approaches between Reliability & Safety