Outage debrief – September 19 2019

What Happened

On Sept 19th at 12:22 ET we deployed some code to enable us to send a silent notification to all iOS users about the upcoming iOS 13 release. With iOS 13 coming out in a matter of hours, Joe made the decision to push these changes as a ‘Hotfix’ which favors expediency over thoroughness. Our internal monitoring caught it within minutes, it was recognized as an outage and the fix was in production in under an hour.

What went Wrong

We rushed the notification, and it bit us.

  • We had a fix for the iOS 13 problems in beta for weeks ahead of the release, but we failed to release early enough.
  • We did not do enough regression testing around core functionality before releasing a hotfix.
  • We rolled to backup servers first, but did not wait long enough for our monitoring to catch the problem before we rolled to our production servers.

How we’re fixing it

  • I’m going to take a look at our release process to identify why the beta stayed in beta so long.
  • We are evaluating what is worthy of a hotfix and evaluating the steps we take in each hotfix deployment.
  • We’ll ensure we give our backup servers enough time to alert us to problems before we continue with a production release.