Instead of Dreading a System Crash, Schedule One and Learn to Avoid ThemThe best defense against outages is to rehearse for the worst and accept real incidents as an opportunity to improve.

ByNisha Ahluwalia

Opinions expressed by Entrepreneur contributors are their own.

According to asurvey by CA Technologies,公司在北美和欧洲失去了更多than $26.5 billion in revenue due to downtime, and that's from 2010!

There are various ways to calculate the monetary cost of system outages but the damage to a company's reputation is immeasurable. When Microsoft's Azure cloud-computing serviceexperienced a major outage recently, experts speculated that it could be a major blow to the software giant's attempt to compete against rivals Google and Amazon.

Related:Safety Dance

Good CEOs and CIOs refuse to accept excuses for even small levels of downtime but it's not easy to hit five nines of reliability. Nonetheless, no matter how complex a company's systems and business, there are always ways to engineer and deliver higher reliability and quality of service. Below are the actions that CEOs need to take to boost their company's reliability:

1. Stop waiting for an outage. Create one.

If you wait for a customer to do something that causes a failure, you're too late. For example, Netflix has tackled unexpected outages using their "Simian Army," a set of automated tools that test applications for failure resilience. However, for most companies, the best way to handle this is to keep it simple.

Encourage your ops and dev teams to schedule a recurring meeting and create outages manually. Injecting failure reveals implementation issues that reduce resiliency while proactively uncovering deficiencies that would otherwise be the root cause of an outage.

Scheduled outages build a strong collaborative culture simply by bringing teams together on a regular basis. Working together to fix artificial failures will combat the idea that an actual failure can be ignored or justified with explanations.

2. Create (and protect) time for learning

No good engineer fixes the same problems without learning in the process. Make sure the teams responsible for resolving incidents have time to work through comprehensive postmortems.

Empower your teams to analyze what worked and what didn't, without forcing them to determine a root cause. All too often, human error is the focus of these conversations but that just isn't healthy. Blameless retrospectives allow teams to uncover the real issues and make proactive adjustments.

Businesses want to move fast but resist the temptation to move onto other issues when systems resume running or when everyone agrees on a "root cause." Invest the time needed to understand how your systems and teams work. See it as an opportunity for the contextual learning needed to make real-time decisions that will improve your company's mean-time-to-resolution.

Related:Does Your Website Have a Crash Plan?

3. Treat your ops and dev teams like sales and marketing. They drive revenue.

If you didn't support your sales teams with tools, training and incentives to hit their goals, people would think you were nuts. Despite their critical role in ensuring your customers are getting value from your company, ops and dev teams often get less attention than their customer-facing counterparts.

Give these employees the infrastructure and tools to achieve peak performance. That includes the latest operations management tools, time and resources for training and goals with incentives to meet them. If you don't provide them with necessary support and recognition, how can you expect them to deliver a high-value product with high availability?

4. Set a high bar for uptime

Even short periods of downtime have a material impact on your bottom line and market perception but once you're committed to supporting your engineering teams, you're in a much better position to set a higher bar for uptime. Build, buy or partner to get the technology and skill sets you need.

Unfortunately, many companies still use homegrown operations management systems without redundancy, and still use disparate tools and manual processes to meander through the incident lifecycle. A focus on reducing ops team costs instead of setting the right culture from the start simply doesn't make sense. The time spent on fixes alone will quickly become a greater cost for your company. Your product and services will suffer as a result.

CEOs who understand the importance of reliability in today's always-on world don't wait until there's an outage to improve operations. They don't ignore the rich learning that come from resolving incidents. They don't treat operations and development teams like the "back office." The CEOs of highly reliable companies invest in their operations infrastructure, processes and people because they care about the growth of their business and the loyalty of their customers.

Related:Go Daddy Outage: What You Can Do If Your Web Service Provider Goes Down

Wavy Line
Nisha Ahluwalia

Vice President of Marketing at PagerDuty

Nisha is vice president of marketing, responsible for all things marketing including generating demand, building the PagerDuty brand and our community activities. She comes to PagerDuty with strong software-as-a-service experience, having built and managed several marketing functions at RingCentral and Cisco WebEx. Before she got into marketing, Nisha got her bachelors of science in Computer Science from San Jose State University.

Editor's Pick

Related Topics

Business News

McDonald's Is Launching a Spinoff Restaurant Chain Based on a Beloved, Blast-From-the-Past Mascot

The company saw a lot of success with another former mascot, Grimace, in June.

Business News

An 81-Year-Old Florida CEO Just Indicted for a $250 Million Ponzi Scheme Ran a Sprawling Senior Citizen Crime Ring

Carl Ruderman is the fifth senior citizen in the Miami-Fort-Lauderdale-Palm Beach metropolitan area to face charges in connection with the scam.

Business News

Body of Missing 27-Year-Old Goldman Sachs Banker Found in Nearby Body of Water

John Castic, a 27-year-old Goldman Sachs employee, went missing around 2:30 a.m. on Saturday after attending a concert at the Brooklyn Mirage in East Williamsburg.

Business News

Taco Bell Slammed With Lawsuit Over 'Especially Concerning' Advertisements, Allegedly Deceiving Customers

The class action lawsuit claims the chain is advertising more than they deliver.

Money & Finance

Want to Become a Millionaire? Follow Warren Buffett's 4 Rules.

企业家是不能过度指狗万官方望太多a company exit for their eventual 'win.' Do this instead.

Business Culture

The Newest Workplace Trend Has HR Sounding The Alarm

HR departments are still figuring out how to handle "quiet quitting," but a new trend is taking over.