On August 31 at 16:08 ET our monitoring alerted us that one of the services that enable us to provide response buttons, device position updates, and chat messages was not responding. Multiple attempts to recover the service failed, so we rerouted to a backup, restoring full functionality at 18:12 ET.
What went wrong
- Our warning monitoring did not detect a problem when it should have.
- We did not reroute to a backup as fast as we could have.
- We lacked the ability to reroute quickly for this particular service.
How we’re fixing it
- We’re updating our warnings to let us fix the problem before it becomes one.
- We’re attempting to break our test environments to identify what triggered the service to become unstable.