2024-03-06 Cloud Email Outage
Incident Report for Schoolbox
Postmortem

Post Incident Report

On Wednesday 6th March it was identified that the server responsible for sending mail for all Cloud Instances was offline. This outage meant that any email sent from Schoolbox Cloud Instances would have been lost whilst the server was offline. Aside from their emails not being delivered, Users would not have had any feedback when attempting to send emails that their emails would not be delivered. This outage also affected automatically sent email, such as New Digests and other notification email. 

Root Cause

The root cause was a loss of IP address by the server, due to failure to successfully renew the DHCP Lease. The DHCP Lease failed to renew due to this expiry coinciding with automated server configuration update and verification process that increased the amount of memory being used, and hence not leaving sufficient memory for the DHCP Lease Renewal.  

Timeline

  • Email delivery failure for cloud instances begun at 4:45pm on Tuesday 5th March
  • The issue was identified around 9:07am on Wednesday 6th March
  • The issue was rectified around 9:30am via rebooting of the server

Short Term Remediation

  • The server was rebooted

Long Term Remediation

  • A review of the Cloud Email Delivery architecture is underway, with the aim of increasing the availability and scalability of Cloud Email Delivery

Learnings/Further Actions

  • Additional monitoring has been implemented to ensure timely notification of Priority 1 outage alerts to the relevant Schoolbox staff

I want to take this opportunity to apologise on behalf of Schoolbox for any inconvenience that was caused. Multiple teams were a part of this PIR process and we are all aware of the learnings and will continue to make on-going improvements across our systems.

Posted Mar 13, 2024 - 23:40 UTC

Resolved
The cloud email server experienced a major outage on 2024-03-05 at around 16:45 AEDT. It remained offline until 09:30 AEDT on 2024-03-06. All emails sent during this period will have been lost.
Posted Mar 05, 2024 - 22:30 UTC