All Member Open Forum

 View Only
  • 1.  Data Center Reliability

    Posted 21 days ago

    Looking for any members working on driving reliability programs into data centers. Seems to me reliability (uptime) in data centers is driven by redundancy vs condition monitoring and predicative maintenance. There are pockets of thermography and some oil analysis but rarely through a programmatic approach. Any views?



    ------------------------------
    Mike Doolan
    Global Technical and Reliability Director
    CBRE
    ------------------------------


  • 2.  RE: Data Center Reliability

    Posted 19 days ago

    Hi Mike. 

    It's great seeing you on here, and that's a great question.  I've always felt like (for ultrasound) there's a tremendous opportunity within data centers.  I've actually had a data center company reach out recently, and I have an initial online meeting with them next week to talk about the implementation of ultrasound and vibration analysis.  They are already using infrared thermography, but want to expand use of other condition monitoring technologies.  I'll pose your question to them and get some feedback for you.  

    Thanks Mike. 



    ------------------------------
    Adrian Messer, CMRP, CRL
    Business Development Manager, USA
    SDT Ultrasound Solutions
    adrian.messer@sdtultrasound.com
    (864)314-9898
    ------------------------------



  • 3.  RE: Data Center Reliability

    Posted 18 days ago

    Hello Mike,

    I've worked on several data center reliability initiatives where we focused on implementing asset lifecycle management and reliability principles. Some of the principles we introduced were criticality assessments, FMEAs, adjusting maintenance plans based on the findings, and identifying opportunities where predictive technologies could or should be introduced.

    From my perspective, data centers are like any other industry: you have to clearly demonstrate the value of doing the upfront reliability work and present a solid business case.

    As most people in this forum know, reliability isn't something you can simply buy and suddenly "have." It's a planned journey that usually requires a cultural shift - and that's often where things fall apart. I've been in facilities (not data centers) where ultrasound, thermography, and vibration tools were sitting on a shelf unused because the reliability champion left and the culture was never established. That cultural foundation is the hard part, which is why many organizations default to redundancy, excessive spares, or "duct tape and bailing wire" approaches to failure.  Keep promoting the wins you may get from your reliability initiatives, there are folks in the DC world listening.

    Hank Kocevar, CMRP




    ------------------------------
    Hank Kocevar, CMRP
    Retired Consultant
    ------------------------------



  • 4.  RE: Data Center Reliability

    Posted 17 days ago

    HI Mike,

    Many sites already utilize thermography, but there is still significant room to expand into vibration analysis, ultrasound, oil analysis, coolant monitoring, contamination control, and generator fuel reliability as part of a broader predictive maintenance strategy. One challenge I have noticed is that many facilities already have massive amounts of sensor and operational data available through BMS, SCADA, DCIM, and other monitoring systems, yet converting that information into actionable maintenance decisions can still be difficult operationally.

    I also think there is still a cultural component within parts of the industry, particularly in some legacy facilities, where redundancy and reactive response models have historically driven reliability strategies more than fully integrated predictive maintenance programs. As these environments continue increasing in complexity, cooling demand, and power density, there is a tremendous opportunity for sensors and condition monitoring technologies to provide earlier visibility into developing issues before they become operational events.

    To me, the real value is not simply adding more sensors or collecting more data, but turning that information into actionable maintenance strategies that help prevent downtime, improve planning, and reduce operational risk.



    ------------------------------
    Linda Perry
    Senior Business Development Executive
    The Viswa Group
    American canyon CA
    ------------------------------



  • 5.  RE: Data Center Reliability

    Posted 15 days ago

    💯Right on target with your assessment Linda.



    ------------------------------
    Hank Kocevar, CMRP
    Retired Consultant
    ------------------------------



  • 6.  RE: Data Center Reliability

    Posted 14 days ago

    Linda's point is exactly where I think the opportunity is. Data centers have often been designed for reliability through redundancy, but redundancy only protects uptime if the redundant assets are healthy, independent, and ready to perform.

    Condition monitoring should be used to validate the redundancy strategy, not replace it. That means turning existing data into maintenance decisions tied to asset criticality and failure modes.

    I would also add that water systems need to be part of that reliability conversation. Cooling towers, condenser water, closed loops, filtration, corrosion control, biological control, makeup water quality, and emerging liquid-cooling infrastructure can all create hidden degradation or common-cause risks across redundant cooling assets.

    Two chillers, pumps, towers, or heat exchangers may look redundant on paper, but if they share the same fouling mechanism, corrosion risk, biological issue, water quality instability, or control problem, the practical resilience may be weaker than assumed.

    So to me, the question is not whether predictive maintenance replaces redundancy. It is whether condition monitoring can continuously prove that the redundancy strategy will actually perform when called upon.



    ------------------------------
    Gregory Peter
    United States
    Aquanomix
    Apex NC
    ------------------------------



  • 7.  RE: Data Center Reliability

    Posted 16 days ago

    Muy interesante enfoque, Mike. Considero que en los centros de datos la confiabilidad debe gestionarse de manera integral, combinando monitoreo de condición, análisis de criticidad y estrategias predictivas alineadas al negocio. Más que actividades aisladas, el reto está en construir una cultura de confiabilidad orientada a maximizar la disponibilidad y reducir riesgos operativos.



    ------------------------------
    Christian Vegas Mori
    Supervisor de Programación de Mantenimiento
    OIG Peru
    El Alto
    ------------------------------



  • 8.  RE: Data Center Reliability

    Posted 14 days ago

    Hi Mike. 

    In addition to the points above, I would emphasize that predictive maintenance must be connected to a clear execution model. Many organizations have already invested in monitoring platforms, sensors, and automation tools, but the reliability value is only realized when those insights are converted into prioritized work orders, planned maintenance activities, stocked critical parts, and verified corrective actions.

    A strong predictive maintenance program should start with known failure modes and asset criticality. For example, vibration analysis should be targeted toward rotating equipment such as pumps, fans, motors, compressors, and chillers. Ultrasound can support leak detection, electrical discharge detection, steam trap assessments, and compressed air system reliability. Oil analysis can provide early indicators of wear, contamination, lubricant breakdown, and internal component degradation. Coolant and fluid monitoring can help identify corrosion, biological growth, improper concentration, and heat-transfer performance issues. Generator fuel reliability is also important because degraded fuel, water intrusion, microbial growth, and contamination can compromise emergency power availability when it is needed most.

    Another important opportunity is to integrate predictive maintenance findings directly into the CMMS. Alerts should not remain isolated in BMS, SCADA, DCIM, or vendor dashboards. They should trigger a defined workflow that includes risk ranking, ownership assignment, job planning, parts verification, execution, and feedback capture. This helps move the organization from simply observing asset conditions to actively managing asset risk.

    There is also value in using predictive maintenance data to continuously refine FMECA and RPN scoring. As more condition data becomes available, teams can better understand which failure modes are increasing in occurrence, which assets are becoming harder to detect before failure, and which risks require a change in maintenance strategy. This creates a feedback loop between field observations, sensor data, incident history, asset criticality, and maintenance planning.

    Ultimately, the next level of reliability maturity is not just technology deployment. It is building the process discipline, governance, and culture needed to turn condition data into timely decisions. Predictive maintenance should help teams move from reactive response to planned intervention, from isolated alarms to risk-based prioritization, and from historical maintenance schedules to dynamic strategies based on actual asset health.



    ------------------------------
    Robert Gafeney
    Sr. Reliability Engineering Manager
    CBRE
    Kathleen GA
    ------------------------------