2023-09-11 Cloud Outage

Incident Report for Schoolbox

Postmortem

Date & Time of Incident:
September 11, 2023, 10:44 AM - 11:15 AM

Overview:
At approximately 10:44 AM, our internal support team identified an issue within our Cloud hosting environment. Subsequently, we received reports from some Schools utilising our Cloud services, confirming downtime on their instances.

Impact:
All cloud clusters experienced downtime.
Total Downtime Duration: 31 minutes (10:44 AM - 11:15 AM)

Timeline:
10:44 AM: Support internally identifies the issue.
10:54 AM: Incident declaration; Slack incident room created.
10:56 AM: Incident published via status page.
11:04 AM: Cloud DNS servers restarted with increased CPU size.
11:15 AM: Service resumed after server restart.

Root Cause:
The root cause of this incident was identified as the installation of Clam AV on the Gateway Servers, which led to the expiration of CPU quotas on these servers.

Learning Lessons:
This incident underscored the critical importance of high availability for DNS services.
Due to the lack of facilities to thoroughly test core services in the Staging environment, we were unable to test the installation process and the role this played in Server resources on these Gateway Servers in Production.

Because of this, we have uninstalled Clam AV from these Servers and will be transitioning from self managed DNS to AWS VPC DNS, so that resourcing will be dynamic on these servers by default.

Posted Sep 18, 2023 - 12:12 AEST

Resolved

We are now closing this incident. We have diagnosed the fault was related to an internal DNS service that ran out of CPU credit due to an update being applied and increased load from services. We have increased the credits on the server and also looking to remove this internal DNS service entirely and replace with AWS DNS services to avoid future instability.

Posted Sep 11, 2023 - 14:54 AEST

Monitoring

We have identified the issue as failure of our primary internal DNS service. We have restarted the DNS services and things are returning to normal. We are currently monitoring the situation and doing investigation to root cause.

Posted Sep 11, 2023 - 11:21 AEST

Investigating

We currently have an issue resolving DNS internally on the cloud hosting environment, which is causing a major outage on all clusters. We are currently investigating.

Posted Sep 11, 2023 - 11:00 AEST

This incident affected: Sydney Cluster 1 (Website), Sydney Cluster 2 (Website), and Sydney Cluster 3 (Website).