On Oct 21st at 11:46 ET our internal monitoring alerted us to a problem. After investigation, we discovered our real-time communication service that enables us to provide response buttons, device position updates, and chat messages was not responding. The service was restarted and the problem was fixed by 12:12 ET.
What went Wrong
- We were investigating and monitoring a memory leak problem. Our projections told us we had a couple of weeks left before we needed to take action, but we had an unanticipated spike in memory consumption that moved the timeline up.
- We had to make some manual adjustments to connect to the right server that added a couple of minutes to the outage.
How we’re fixing it
- We’re making sure to eliminate those manual adjustments we needed to make to connect to the server.
- Despite logging 40 hours against finding the cause of the memory leak, we are no closer to a fix. We’ll be setting up a meeting to determine how best proceed to minimize and eventually eliminate the impact of this problem and service.