Cloud Redis Outage

Incident Report for Schoolbox

Postmortem

Overview

On the 26th of May 2023, an outage affecting the Sydney cluster was identified resulting in all schools in the cluster being impacted from 8:40am - 9:00am

Initial analysis identified a massive spike in memory usage from 8Gb to 14Gb in redis
Restarting the redis have temporarily solved the problem and all instances were back up and running by 9:00am

Impact

Affected all schools on the Sydney AWS cluster

Timeline

26/05/2023 08:50 AM- Support received a number of calls and tickets from a number of cloud schools saying users cannot access the prod instance and reported the same to Infra
26/05/2023 09:00 AM - Restarted redis which temporarily solved the problem
26/05/2023 09:25 AM - Declared as incident and assigned action items to investigate
26/05/2023 12:15PM - Altered the redis server from r4.large to r4.xlarge increasing memory from 15Gb to 30Gb

Root Cause

Initial analysis points to the increase in user session data that is stored in redis

Learning lessons

Redis memory allocation could have been higher than what was previously allocated
Datadog alert did not trigger
Having one redis that serves the entire cloud affected the impact of the incident
Stagger cloud releases for each cluster

Posted Jun 19, 2023 - 14:12 AEST

Resolved

On the 26th of May 2023, an outage affecting the Sydney cluster was identified. This resulted in all schools in the cluster being impacted from 8:40 AM - 9:00 AM.

Timeline

- 26/05/2023 08:50 AM- Support received a number of calls and tickets from a number of cloud schools saying users cannot access the prod instance and reported the same to Infra

- 26/05/2023 09:00 AM - Restarted redis which temporarily solved the problem

- 26/05/2023 09:25 AM - Declared as incident and assigned action items to investigate

- 26/05/2023 12:15PM - Altered the redis server from r4.large to r4.xlarge increasing memory from 15 GB to 30GB

Posted May 26, 2023 - 09:00 AEST