Root Cause
The root cause was a change to search functionality that cleaned up old searches. This was paused temporarily to allow the CPU to recover and restore normality. This change has since been rectified.
Timeline
- Service disruption specific to cloud occurred at 10 AM
- This issue persisted to approximately 11:27 AM
- At 11.27 AM, a hotfix was applied to a running search function that was detrimental to the database and affecting performance.
Short-term Remediation
- A hotfix was deployed to cloud servers to disable the build index script at 11:30 AM on Feb 23.
- A 23.1.7 point release was deployed on the 26th of Feb that removed the forum delete function from the search index during its 5 minute run (this meant forums deleted will still appear in search results for up to 24 hours, before a full index rebuild occurs nightly).
Long-term Remediation
- Ensure only workloads required on specific events such as creating, deleting and modifying an object are committed to the index for search.
Learnings/Further Actions
Following this incident we have formulated the following actions to prevent similar events occurring in the future:
- Consolidated reporting errors for infrastructure emails
- Surface cluster errors on load to internal dashboard
I want to take this opportunity to apologise on behalf of Schoolbox for any inconvenience that was caused. Multiple teams were a part of this PIR process and we are all aware of the learnings and will continue to make ongoing improvements across our systems.