Despite Rising Automation, Human Error is a Top Cause of Downtime. Here’s How to Avoid It.

Image

March 1, 2023

human error often causes data center downtime

Another week, another story about a major data center outage. This time it’s British Airways under public scrutiny as the company scrambles to discover the source of data center downtime that grounded hundreds of flights.

While the cause of that outage isn’t yet released, that hasn’t stopped some experts from suggesting human error. They aren’t likely to be off base, either: human error remains the leading cause of IT infrastructure outages. Therefore minimizing human error should be a primary focus of reliability efforts.

While we all make mistakes, when critical infrastructure is at stake — not to mention thousands of dollars in downtime related costs — it’s worth some investment to try and reduce the potential negative effects of people on IT systems. Here are some tips to help you avoid downtime stemming from human error.

 

Throwing Tech at the Problem Only Goes So Far

The traditional methods to avoid downtime tend to focus on redundancies in data center design and equipment, geographically separate facilities with linked systems, and a new focus on automation via DCIM and software-defined data center technology.

These are all valuable additions to a data center and can go a long way towards improving reliability of the facility as a whole. Multiple fiber connections, diesel powered generators, redundant network design, multiple UPS systems, and a disaster recovery plan are pretty much essential components of a modern enterprise data center. There should never be a single point of failure in your equipment that can take out the entire facility.

And yet, British Airways may have faced that very problem. Their data centers were almost certainly designed with reliability and redundancy in mind, but something happened that halted system failover to the second site. A UPS system at one of the sites was shut down, despite a combination of main power, batteries, and diesel backup. This may have been due to a surge or loss of voltage on the main power feed from the public utility. Why systems did not move over to the second site remains a mystery.

While automation is likely to be the future of data center management, it can’t replace humans just yet (and maybe never completely). As the industry embraces software-defined technology and robotics, we may see routine data center maintenance tasks handed off to robots with a much lower chance of failure. In the meantime, humans are still racking and still performing software updates. That’s unlikely to ever go away completely.

Finally, alarms and access control points are obviously vital to data center security and to create an audit trail. They can help you pinpoint the cause of downtime, but they may or may not be able to outright prevent it. Still, configuring danger alerts for critical systems like cooling, power, humidity, and fire suppression can help you get out in front of mechanical errors.