Outage Debrief – October 31 2019

What Happened

On Oct 21st at 11:46 ET our internal monitoring alerted us to a problem.  After investigation, we discovered our real-time communication service that enables us to provide response buttons, device position updates, and chat messages was not responding.  The service was restarted and the problem was fixed by 12:12 ET.

What went Wrong

  • We were investigating and monitoring a memory leak problem.  Our projections told us we had a couple of weeks left before we needed to take action, but we had an unanticipated spike in memory consumption that moved the timeline up.
  • We had to make some manual adjustments to connect to the right server that added a couple of minutes to the outage.

How we’re fixing it

  • We’re making sure to eliminate those manual adjustments we needed to make to connect to the server.
  • Despite logging 40 hours against finding the cause of the memory leak, we are no closer to a fix.  We’ll be setting up a meeting to determine how best proceed to minimize and eventually eliminate the impact of this problem and service.