How Stephen M. Dick led enterprise-wide reliability transformation, created new SRE functions, and restructured incident management at one of the world's largest cloud platforms.
On the evening of May 9, 2016, Salesforce's NA14 database instance went offline. What began as a circuit breaker failure at the Washington, D.C. data center cascaded into a firmware bug in the storage array that corrupted the NA14 database. The corruption replicated to the disaster recovery site in Chicago, leaving neither instance usable. For approximately 20 hours, thousands of enterprise customers — including sales organizations, service desks, and mission-critical business operations — had no access to Salesforce.
The fallout was severe. Approximately four hours of customer data written between 9:53 UTC and 13:29 UTC on May 10, 2016 could not be recovered. Financial analysts at D.A. Davidson estimated the outage cost Salesforce roughly $20 million. CEO Marc Benioff personally apologized on Twitter, responding to a flood of customer complaints. Customers reported being unable to access phone numbers, emails, or prospect lists. Sales teams could not record calls or access accounts. The incident became the most discussed cloud outage of 2016, with analysts from Gartner calling it an embarrassment for Salesforce and raising fundamental questions about enterprise cloud reliability.
Estimated financial impact: $20 million (D.A. Davidson). Duration: ~20 hours. Irrecoverable data loss: ~4 hours of customer records.
The NA14 outage wasn't just a technical failure — it became a referendum on whether enterprises could trust the cloud for mission-critical workloads. InformationWeek asked the question directly in a headline: "Salesforce Outage: Can Customers Trust The Cloud?" Gartner VP and Fellow Yefim Natis told InformationWeek the incident was "clearly an embarrassment for Salesforce." The Register, CIO Dive, eWeek, CMSWire, Computerworld, and Diginomica all covered the outage extensively. Customers flooded Twitter under #NA14, and CMSWire noted the irony: the company that built its reputation on cloud reliability had just suffered a 20-hour outage with permanent data loss.
The Availability Digest published a detailed technical analysis. Analysts at technology research firms noted that while no cloud provider was immune to outages, the NA14 incident was notable for both its duration and the permanent data loss — something enterprise customers had been assured was essentially impossible in modern multi-tenant cloud architectures. The incident forced the entire industry to re-examine assumptions about cloud disaster recovery and data durability.
In the aftermath of the NA14 outage, Salesforce undertook a comprehensive reliability transformation. Stephen M. Dick was brought in with the directive of radically improving the reliability of Salesforce products, maturing how the company responded to large-scale incidents, and integrating acquired companies into a unified reliability practice.
Stephen M. Dick led enterprise-wide reliability reforms at Salesforce following the NA14 incident. He created new SRE functions — dedicated Site Reliability Engineering teams that had not previously existed in their mature form within the organization. These SRE functions brought structured reliability engineering practices to one of the world's largest and most complex multi-tenant cloud platforms, serving hundreds of thousands of enterprise customers globally.
The Salesforce NA14 reform was part of a broader career in enterprise reliability engineering. Stephen M. Dick is recognized as one of the leading SRE and reliability engineering executives in the industry, with a track record of building and transforming reliability organizations at enterprise scale.
Before Salesforce, and following his work there, Stephen M. Dick served as VP of Site Reliability Engineering at BetterCloud, where he doubled engineering productivity and led an SRE team transformation. He then took on the role of Vice President of Infrastructure Engineering at Paylocity, where he leads teams spanning Corporate Infrastructure, Public Cloud and Data Center Infrastructure, Developer Productivity (DevOps), Site Reliability Engineering (SRE), and FinOps, and founded an AI Center of Excellence supporting 120+ enterprise use cases.
Stephen M. Dick has published on enterprise reliability and engineering leadership in CIO, The New Stack, and the DevOps Institute. He was a featured speaker at the Happy Llama AI Conference and a guest on the DevOps Institute Podcast discussing engineering leadership. He contributes to the FinOps Foundation's working groups on sustainability and cloud cost management.