Stamping Out the Main Causes of Data Center Downtime
With the average cost of data center outages hovering at $740,000 (according to a Ponemon / Emerson study from 2016), operators must take action to avoid the most common causes of downtime. Let’s take a quick dive into the leading origins of unplanned downtime and how you can avoid them in your data center.
This is the simplest cause and also one of the hardest to avoid. Simply put, people make mistakes. With 22% of outages stemming from human error, it’s not something you should overlook. Training and simple modifications to equipment and controls can help reduce the amount of human error. Add shielding to any emergency off buttons, manual shut off switches, or any other area that can lead to an immediate, unplanned outage. Make sure these emergency off toggles are very clearly labeled and covered.
It sounds harsh, but your employees must be disciplined. No food and drink policies are standard while on the data center floor, so make sure they’re enforced. The last thing you need is a technician spilling Mountain Dew all over an essential component. Post signs at the data center entrances and reinforce during initial and ongoing training.
While you’re revising your documentation to include regular training and beverage policies, add documented procedures for maintenance, upgrades, and other installations. By spelling things out step-by-step you can keep risk low.
Finally, make sure you know who is in your data center to avoid equipment damage or data breaches. All access logs must be up to date and every visitor should be escorted and logged. Install video surveillance where possible and keep each section of your data center under a layered security system.
UPS failures remain the most common reason for outages. They can stem from battery/equipment failure or an excessive power draw beyond the UPS capacity.
In a power outage, the UPS system draws from its battery backups, making them an essential piece to maintain uptime. However, batteries do not last forever. Follow all maintenance schedules and best practices from the manufacturer to check battery health. At least quarterly, batteries should be checked for correct installation, discharge, and charging. This includes visual inspection, capacity checks, and regular monitoring through software or the UPS unit itself.
While higher operating temperatures are now commonplace in white space, higher temps can shorten your battery life. A separate UPS room could help alleviate early wear and tear. Avoid frequent discharge and look for loose connections or worn terminals.
As data centers become more and more dense, they are drawing more power at each rack. Don’t allow your UPS design to fall below your average IT load. A Data Center Infrastructure Management (DCIM) platform can help you evaluate power draw throughout a given period. Redundant UPS systems are also a necessity to achieve the goal of 100% uptime. Be sure to design your facility with N+1 capacity even through projected growth stages as it will make the addition of future UPS units simpler.
Cyber crime / DDoS
Cyberattacks have jumped the ranks as a top cause of unplanned downtime in the past few years, going from just 2% of outages in 2010 to 22% in 2016. Data center operators must take action to establish early detection and mitigation systems.
A large scale DDoS attack can be difficult to defend against. Most ISPs provide some protection at Layer 3 and Layer 4 of the network, but your services need extra defense at Layer 7, which can be targeted specifically via HTTP GET or similar attacks. Firewalls, IPS/IDS, and DDoS mitigation services can all be combined to reroute traffic. Learn more about the rising tide of DDoS attacks on data centers and how to defend yourself.
Heat Related Failures
CRAC failures have also become increasingly common as densities rise. Many cooling systems were not designed around the increased heat density in a packed modern data center. Once again, projecting out your data center floor to 100% capacity can help plan for future cooling loads. Be sure to implement containment in some form, whether hot or cold aisle. This can help mitigate hot spots, which you can spot with heat modeling software and some DCIM systems.
If your design implements rack-level cooling with an immersion element, chemical refrigerants are a better option than water-based systems. With water as a cooling material, you run the risk of taking down entire racks or floors should the system fail.
Every data center, from small operations through enterprise and service-provider scale facilities, must strive for 100% uptime to deliver reliable services to end-users. By taking the time to plan for the future, your data center can avoid some of the most common causes of downtime, but it will likely involve some additional investment as well as the time investment.