Overview
On the 26th of May 2023, an outage affecting the Sydney cluster was identified resulting in all schools in the cluster being impacted from 8:40am - 9:00am
- Initial analysis identified a massive spike in memory usage from 8Gb to 14Gb in redis
- Restarting the redis have temporarily solved the problem and all instances were back up and running by 9:00am
Impact
Affected all schools on the Sydney AWS cluster
Timeline
- 26/05/2023 08:50 AM- Support received a number of calls and tickets from a number of cloud schools saying users cannot access the prod instance and reported the same to Infra
- 26/05/2023 09:00 AM - Restarted redis which temporarily solved the problem
- 26/05/2023 09:25 AM - Declared as incident and assigned action items to investigate
- 26/05/2023 12:15PM - Altered the redis server from r4.large to r4.xlarge increasing memory from 15Gb to 30Gb
Root Cause
- Initial analysis points to the increase in user session data that is stored in redis
Learning lessons
- Redis memory allocation could have been higher than what was previously allocated
- Datadog alert did not trigger
- Having one redis that serves the entire cloud affected the impact of the incident
- Stagger cloud releases for each cluster