Cloud Redis Outage
Incident Report for Schoolbox
Postmortem

Overview

On the 26th of May 2023, an outage affecting the Sydney cluster was identified resulting in all schools in the cluster being impacted from 8:40am - 9:00am

  • Initial analysis identified a massive spike in memory usage from 8Gb to 14Gb in redis
  • Restarting the redis have temporarily solved the problem and all instances were back up and running by 9:00am

Impact

Affected all schools on the Sydney AWS cluster

Timeline

  • 26/05/2023 08:50 AM- Support received a number of calls and tickets from a number of cloud schools saying users cannot access the prod instance and reported the same to Infra
  • 26/05/2023 09:00 AM - Restarted redis which temporarily solved the problem
  • 26/05/2023 09:25 AM - Declared as incident and assigned action items to investigate
  • 26/05/2023 12:15PM - Altered the redis server from r4.large to r4.xlarge increasing memory from 15Gb to 30Gb

Root Cause

  • Initial analysis points to the increase in user session data that is stored in redis

Learning lessons

  •  Redis memory allocation could have been higher than what was previously allocated
  • Datadog alert did not trigger
  • Having one redis that serves the entire cloud affected the impact of the incident
  • Stagger cloud releases for each cluster
Posted Jun 19, 2023 - 04:12 UTC

Resolved
On the 26th of May 2023, an outage affecting the Sydney cluster was identified. This resulted in all schools in the cluster being impacted from 8:40 AM - 9:00 AM.

Timeline

- 26/05/2023 08:50 AM- Support received a number of calls and tickets from a number of cloud schools saying users cannot access the prod instance and reported the same to Infra

- 26/05/2023 09:00 AM - Restarted redis which temporarily solved the problem

- 26/05/2023 09:25 AM - Declared as incident and assigned action items to investigate

- 26/05/2023 12:15PM - Altered the redis server from r4.large to r4.xlarge increasing memory from 15 GB to 30GB
Posted May 25, 2023 - 23:00 UTC