I'm attempting my first asset criticality analysis and I have a couple questions...
1.) When we score each category, are we assuming that the worst, most plausible scenario occurs? For example, safety: we have rotogravure presses that have class 1 div 1 areas. The worst case scenario would be an explosion that could result in death.
2.) I received an asset criticality analysis template from the Mobius Institute (iLearnReliability) and it multiplies the likelihood against the sum of all the scores (for example, safety, environmental, production, quality). I'm confused by this since the likelihood of an explosion in a class 1 div 1 area (as mentioned above) should be low compared to a production loss of over $10k. I would think that we should assign a different likelihood for each category.
Any help would be greatly appreciated!
Andrew, email me and I can share some info. email@example.com
Always consider a SINGLE failure at the worst time.. high flow, high demand, bad weather etc.. Only consider a secondary failure with safety devices. An example is a fire extinguisher, fire alarm, smoke detector, etc. has no consequence of failure unless there is a secondary failure...a fire. I believe the same is true with your div 1 class 1 scenario. The consequence only occurs if you have TWO failures....a bad atmosphere (high LEL) and an ignition source. I would assume that you have gas detectors to warn you of a high LEL.
Thanks for your help, Jeffrey.
Just wanted to make sure that I understand. For example, we had a guard on our rotogravure press that fell and rubbed on a rotating cylinder, which would have resulted in a spark and an explosion if the operator hadn't caught it. Would the guard falling and causing a spark be considered the single failure?
How would you rate the likelihood of this failure since it was the first time that it happened in over 20 years?
I guess I don't understand the environment in the area. If the area ALWAYS has a high LEL, then you have bigger problems. So, if the area is ventilated and typically does not have a high LEL, then you have two failures with the example you gave. One, the guard failed, TWO the environment reached a high LEL. I not sure what the failure would be for a high LEL environment. Maybe failed ventilation, broken pipe, sela failure, etc..
As far as likelihood, I would say it is an unlikely failure mode...based on the 20 year time frame.
I have seen typical likelihood of failure have two categories or more. The most common is asset condition. The theory being that if an asset is in bad condition, it is more likely to fail. The other very common category is O&M Protocols. This relates to how robust you PM program is with both detailed PMS and compliance.
I see some great guidance from the others who have replied. As it pertains to the failure of the guard, remember to use the "five why's" in your failure analysis. Where your criticality analysis is based on your knowledge of the process, the introduction of failure mechanisms that are considered "off normal" might lead you to other practice improvements to eliminate such mechanisms (e.g., maintenance standards, post-maintenance inspections, check-offs, etc.).
All the best,
I have worked with a number of clients to establish an asset criticality assessment (ACA) process. It's important to remember that the ACA process is intended to help you optimize your resource investment in improving reliability. It is meant to identify the most "critical" systems and assets and the least critical systems and assets. Knowing if an asset is number 12 or number 10 on that list isn't the goal. Therefore a consistent, fact based, ranking method, that can be performed consistently and cost effectively, by different teams, is what is important.
As you, and others, point out, you need to determine the likelihood of an event (i.e. asset failure caused event) and the impact of that event. As you note, that is generally done for multiple categories and one event may have varying consequences for the various categories. This leads to a choice of how to treat the various "risk" results by system and by asset. The two methods that we used were to either use the highest single risk (i.e. probability X consequence) or to use the sum of the top risks by category. The ranked order of the risk results are very dependent on how you choose to quantify your consequences of an asset failure. How does a serious personnel injury event compare to a day of lost production? All of that said, if you have a well documented definition and process, you can present and defend your results to your stakeholders with confidence. Debate can then focus on how you quantified the risks. If that changes, you can quickly update your rankings. Even when that happens, the rank order will likely not change dramatically.
I know it's a long answer to a short question.
Thanks for your help, I appreciate the "long answer". This is the risk matrix that we developed. We are only defining the consequence and scaling it based on what we felt was more important.
Forgive my ignorance, but I'm still confused about how to score the likelihood. For example, if we safety=4 (may result in death) and quality=4, then how should I score the likelihood of the asset failure? Is it based on how often the asset fails (MTBF. Note: we didn't include MTBF in the matrix since we just got a new CMMS and do not have that data) or the likelihood that the failure may result in death (due to the safety score) and/or the likelihood that it may result in a significant quality defect?
There are different ways to approach this as you have seen. I agree that as long as you follow typical conventions and have a well documented process you will get good results you can stand by.
1.) Single point failure is important to define and consider under this analysis. If the asset in question could possibly result in a hidden failure (the smoke alarm example) then I have marked as so and treated these items separately. We perform PHA and ACA analysis. Where the PHA is concerning the worst case scenario for process safety consideration. For our ACA more focused on reliability, I look at the most likely consequence of failure at the likely first detection. I typically would not focus on the worst case imaginable for my ACA. In your example of a class 1 div 1 area. It may be reasonable to expect an explosion resulting in death as I don't know the process.
2.) You should be considering the likelihood of the asset failure. The asset fails when the asset fails. The safety consequence is very high, and that will be reflected in the final results. The likelihood that the asset won't perform it's intended function is improved by different actions then the actions to reduce the consequence. There may be exceptions but I believe that is typically how it is treated.
Thanks for your help, Jesse. I replied to a couple of the other messages and would love your feedback as well...
For example, we had a guard on our rotogravure press that fell and rubbed on a rotating cylinder, which would have resulted in a spark and an explosion if the operator hadn't caught it. Would the guard falling and causing a spark be considered the single failure? How would you rate the likelihood of this failure since it was the first time that it happened in over 20 years?
Forgive my ignorance, but I'm still confused about how to score the likelihood. For example, if we safety=4 (may result in death) and quality=4, then how should I score the likelihood of the asset failure? Is it based on how often the asset fails (MTBF. Note: we didn't include MTBF in the matrix since we just got a new CMMS and do not have that data) or the likelihood that the failure may result in death (due to the safety score) and/or the likelihood that it may result in a significant quality defect? (This is the risk matrix that we developed. We are only defining the consequence and scaling it based on what we felt was more important)
As previously mentioned it seems you're trying to create a Risk Matrix rather than a criticality matrix. It is highly advisable to have each concept clear.
Being that said and specifically answering your questions:
1- In the context of risk matrix, your example: "we had a guard on our rotogravure press that fell and rubbed on a rotating cylinder, which would have resulted in a spark and an explosion if the operator hadn't caught it. Would the guard falling and causing a spark be considered the single failure?" Yes, this is a potential failure and needs to be assessed as such.
2- Your second question, "How would you rate the likelihood of this failure since it was the first time that it happened in over 20 years?". Typically you will use historical data within the organization to properly assess the frequency criteria. If not available I can suggest using a maintenance and reliability database for the industry (similar to what is OREDA for O&G, I've used it in the past for this type of exercise). Finally if none of those are available for you, subject matter experts (SME) feedback is also a valid alternative; SME can internal or external to the organization. Last, but not least, you need to consider the largest amount of past events this type to best assess it. For your specific example, yes perhaps you might think this event is "very unlikely to happen" category because has never happened in your organization, but what about in similar companies using the same equipment? what about in other companies with similar applications? what about within the industry? The more information you have the best assessment you can do.
You should ultimately get a Risk Matrix similar to OREDA's (see attachment as example).
Last two comments:
1- For asset criticality I recommend you checking both the article shared by Jesse and the spreadsheet shared by Jeremie. They both nail it in terms of concept and application.
2- The Risk Matrix creation exercise, from my experience, is a joint effort from not only M&R but also EHS, Production, Finance and Management. I mention this because the resulting risk matrix must be aligned with each department's metrics, goals and criteria. I strongly recommend get those department involved into the exercise so that the resulting document has not only an enriched discussion but also a smoother application.
I hope it helps.
Rafael Cardenas, CMRPReliability EngineerStemcell TechnologiesVancouver, BC
Consider looking at TRIAD Relialytics
You've received some solid guidance from the other posters, but I wanted to point out an important aspect of your question that is a major factor in how you can use the results of your analysis.
The analysis tool you're using is asking for both the consequences and the probability of a failure event for each asset. (Consequence of failure) * (Probability of failure) is actually the formula for risk, not criticality. While you definitely want to base your asset decision-making around the risk of failure, it's important to keep in mind that the probability of failure will change whenever the condition of the asset changes (e.g. maintenance intervention, degradation from use, asset replacement). The consequence of failure will only change when the asset function changes (e.g. system re-design).
Therefore, I prefer to keep asset criticality and asset risk as separate ideas and values. Changes in asset criticality (consequence of failure) can be managed through your Management of Change process. Changes in asset risk (consequence * probability) requires more dynamic management, with inputs from on-line monitoring, inspection results, and work orders completed.
Feel free to message me if you'd like to discuss further.
Great point Brian,I recently came across this important concept in an article that talks about the definitions of Risk vs Criticality.
I was working with a client a while back that had a very robust criticality matrix that he graciously gave me permission to share with folks. The parameters can obviously be catered to your specific facility/needs, but it is a fantastic start in my opinion.