This is the second installment in a three-part series on the correlation between reliability and safety.
Let’s now explore the Reliability practitioner’s perspective.
The Reliability Practitioner’s Perspective
Since optimizing Reliability has a great deal to do with thoroughly understanding gaps (expected and unexpected) in performance, RCA plays a big role in this understanding. If ‘shallow cause analysis’ is practiced, as opposed to ‘root cause analysis’, then such gaps in performance may continue to persist as failures will tend to repeat. So how we analyze deviations from an operational potential, is critically important.
There is an emerging group of Safety professionals that are advocates of Dr. Dekker’s ‘Safety Differently: Human Factors for a Different Era’ approach. This group views all ‘RCA’ through a very narrow lens and therefore believes it is of limited value.
I believe their view and definition of RCA is not accurate at all, and does not represent the reality in the field of those who practice effective RCA approaches on a daily basis. Sure there is an abundance of ‘shallow cause analysis’ approaches that mask themselves as RCA, but that happens in every field (i.e. – RCM). We don’t discount the value of the entire discipline because of the actions of a few bad actors within it. We move on and let our bottom-line results do the talking for us.
This is an example of this narrow view of RCA, from the Safety field. This is Dr. Conklin’s published definition of RCA:
Root Cause Analysis: RCA is widely viewed as a reactive tool that requires a high severity trigger in order to be applied. The trigger could be excessive costs/downtime, regulatory violation, injury and/or death. RCA is often associated with being a tool applied effectively, only on mechanical (physical) failures. In a classic RCA, it deconstructs the event down to its minutest part, analyzes those parts and fixes whatever is broken (Conklin, 2014, p. 68)
In my white paper, ‘Do Learning Teams Make RCA Obsolete?’, I go into great depth to describe the elements of an effective, holistic RCA approach. In just breaking down this definition of RCA, Dr. Conklin is concluding that:
- RCA is strictly a reactive tool triggered only by severe incidents, and
- RCA is primarily only applicable to mechanical failures, and
- RCA stops at a component-level and identifies only a single, physical root cause
As a career Reliability practitioner with a specialty in RCA, the above conclusions are not an accurate representation of the practical application in the field.
As shown in Figure #3, in the Reliability world, the investigation starts with the available evidence (facts) and then strives to reconstruct the failure. As the re-construction drills backwards in time we will normally come across causes attributable to the physical nature of the failure (i.e. – erosion, corrosion, fatigue and overload). However, a true RCA will continue to drill down and understand how those physical conditions came to be.
Inevitably we will come across a human element where there was a decision error. These are usually errors of omission (we didn’t do something we should have) or errors of commission (we did something we shouldn’t have). It is here, where it would be easy to blame the decision-maker. However, the more sophisticated operations and investigators realize that it is at this point, the investigation is really just beginning.
This is because the goal of a true RCA is to understand the reasoning behind the decision errors (the ‘whys’ or ‘sensemaking’), and not necessarily who made the bad decision.
Figure #4. Reconstruction of an Event When Applying Effective RCA Approach
The reality from an RCA practitioner’s perspective is:
- RCA can be applied proactively to analyze why unacceptable risks exist (as defined by risk assessment tools like FMEA), as well as near-misses (and even high frequency, low cost chronic failures)
- RCA can be applied to any gap in performance of any kind. Simple examples could be where financial expectations were not met, customer complaints, late deliveries, and lacking sales performance
- An effective RCA does not stop at the component level. It explores why good people made bad decisions at the time, and does not seek to blame someone. It goes on to ask ‘why did the decision-maker think it was the right decision at the time?’ The answers to these questions will uncover organizational system flaws, restraining paradigms, cultural norms and sociotechnical factors that influences those decisions. This level of depth is where the gold is…but only if we choose to explore that level of depth!
Depending on where the investigator stops their reconstruction effort, will determine how effective the RCA will be. If we stop at the physics of the failure (i.e. - replacing broken parts), or worse yet at blaming someone, then we will likely see a recurrence of the event. If we stop short of understanding human reasoning, or system deficiencies, then I consider this a ‘shallow cause analysis’ approach. We may be compliant by regulatory standards, but that does not mean we are safer.
Uncovering system deficiencies will affect much more than the event we are investigating. This is because systems are created for populations of people and not just a single person. So when there is a system flaw (i.e. – an obsolete procedure that remains in place), the potential for an undesirable outcome is higher because other people using that system may make similar erroneous decisions.
Each of the intervals cited on the DIPF curve in Figure #1, could contribute to an overall lapse in Reliability if they are not functioning as intended. This simply means that an effective RCA could capture that deficiency and drill down to its systemic roots.
We can see, if one holds such a narrow view of RCA, it would certainly contribute to poor Reliability performance. If RCA’s don’t delve into understanding human performance and human factors, then the risk of recurrence is greater…hence is the risk of harm to employees is greater.
This is an excerpt from the BP U.S. Refinery Independent Safety Review Panel that is relevant at this point, “Preventing process accidents requires vigilance. The passing of time without a process accident is not necessarily an indication that all is well and may contribute to a dangerous and growing sense of complacency. When people lose an appreciation of how their safety systems were intended to work, safety systems and controls can deteriorate, lessons can be forgotten, and hazards and deviations from safe operating procedures can be accepted. Workers and supervisors can increasingly rely on how things were done before, rather than rely on sound engineering principles and other controls. People can forget to be afraid.”
 Dekker, Sidney. 2014. Safety Differently: Human Factors for a Different Era. CRC Press. Boca Raton.
 Latino, Robert. Do Learning Teams Make RCA Obsolete? Accessed on 1.17.18 at https://reliability.com/pdf/rca-vs-hpi-2017-rci.pdf.
 2007. BP U.S. Refinery Independent Safety Review Panel. Page i.