In May 2024, we faced a systems outage that impacted our infrastructure, applications, and data.
It was an unprecedented, one-of-a-kind incident that required an extensive recovery and restoration effort. Importantly, this was not the result of a malicious act or cyber-attack. Enabled by our multi-cloud strategy and robust data recovery practices, we were able to navigate the complexities of the situation, yet it was still a challenging process that affected our members and employers.
An independent audit confirmed that the root cause originated from a third-party cloud provider error. While this was beyond our direct control, we take our accountability seriously.
Watch UniSuper's Chief Technology Officer, Steve McGregor, share an overview of the lessons we've learned.Lessons from the outage
The audit brought about a range of findings—both things we did and are doing well, and importantly, post-outage opportunity areas.
“At all times, our approach was to ensure that no member was worse off because of what had happened, and always protecting members’ best interests,” UniSuper Chief Technology Officer, Steve McGregor, said.
“Importantly, no member data was lost, exposed or at risk.
“The audit reinforced that while there are always areas for us to improve, we largely performed well during the black swan event.”
Here are some of the opportunity areas we’re already working on:
Vendor management
We take our responsibility as a trustee seriously and will always act in members’ best financial interests, including when third parties are involved. Like many organisations, we use third party services in various aspects of our day-to-day operations. The outage reminded us how important it is to enhance our vendor management, risk control and oversight practices.
Two new vendor initiatives have been introduced:
- Vendor Management Manual – this introduces specific governance protocols for technology vendors.
- Vendor Risk Module – this aims to improve and standardise vendor risk processes and related controls.
Communication and transparency
When the outage occurred, keeping our members, employers, and stakeholders informed was a top priority. Through timely updates and clear communication, we ensured they had the information they needed while restoration efforts were underway. We’re sincerely grateful for their patience and understanding during that time.
Strong collaboration between teams and leaders enabled prompt information sharing and reporting, allowing us to deliver clear, co-ordinated messaging throughout the incident.
While our communication approach was a key strength, we’re always seeking to improve. As part of our review, we identified opportunities to enhance application health monitoring and reporting, ensuring even greater visibility across the business during future incidents.
Operational resilience through simplification
Resilience, leadership and good governance practices formed the foundations of our business-wide response to the outage, helping us to fully restore systems within 14 days. Crisis management structures were already in place prior to the event, like our crisis communications team, with pre-established responsibilities. This included our experienced executives, who have navigated past adverse events.
An already strong base means we can further bolster these, alongside our operational efficiency. We continuously review our internal policies and procedures to ensure they remain effective. Following the outage, we've placed an even greater focus on identifying opportunities to simplify and enhance efficiency.
Among these are our back-up and recovery processes. We’re exploring how we can responsibly use AI to make this happen and aiming to extend that to the way we service members.
Where are we now?
We continue to focus on the learnings from this experience and strengthen our processes, so that if we ever face a similar – or more extreme – challenge again, we’re even better prepared and positioned to deliver the services our members expect and deserve.
-
Read the transcript
Hi everyone. I'm Steve McGregor, Chief Technology Officer at UniSuper.
Following the outage we experienced in May 2024, we committed to sharing with our members some of the lessons we learnt from our response. At all times, our approach was to ensure that no member was worse off because of what had happened and that we were always protecting our members’ best interests.
Importantly, no member data was lost, exposed, or at risk. We've just completed a detailed audit with a third party, and pleasingly, they have reinforced that while there are always areas for us to improve, we largely performed well during the ‘black swan event’. Specifically, the report revealed the following key points. Our multi-cloud backup and recovery setup was crucial in restoring backup images and transaction data as soon as possible.
Our strong governance, leadership, and focus on our members and employers meant that we were able to work well together and respond to the outage. Regular and timely communications to members ensured that you were updated on our response promptly. The contributions from major suppliers were essential in overcoming technical challenges and restoring all systems, data, application and services to normal operations within 14 days. A remarkable effort.
We've posted an article on our website with more information about the lessons we've learned from our response, and we encourage you to read it. And as always, if you have any questions, do get in touch.