Featured image

Swift Incident Response: Lessons from the Sui Network Outage

On November 21, 2024, the Sui Network experienced a brief outage due to a bug in congestion control. The incident, which occurred between 1:15 and 3:45 am PT, involved a crash loop affecting all validators, preventing any transaction processing. The Sui Foundation quickly identified the problem and worked with the validator community to restore operations within 15 minutes of releasing the fix.

The incident highlights the importance of effective incident response mechanisms in the insurance industry. As insurance executives, you understand the critical role of risk management and mitigation. The Sui Network outage serves as a reminder that even the most advanced systems can experience technical glitches, and it’s crucial to have a plan in place to address them swiftly.

The issue stemmed from a bug in the congestion control code, specifically an assert! statement, which triggered a crash when the estimated execution cost was zero. This problem was linked to the TotalGasBudgetWithCap mode, briefly enabled in protocol version 63 and reintroduced in version 68. The bug manifested when the network received a transaction with a mutable shared object input and zero MoveCall commands, causing all validators to crash.

Congestion control in the Sui network is crucial for managing transaction rates to shared objects, ensuring the network does not become overloaded. This system was recently upgraded to enhance shared object utilization by accurately estimating transaction complexity. However, the upgrade inadvertently introduced the bug causing the outage.

Upon identifying the problem, Sui engineers promptly devised a fix. The corrective code was deployed to both the Mainnet and Testnet in versions v1.37.4 and v1.38.1, respectively. The rapid deployment was facilitated by an outstanding response from the validator community, enabling the network to resume operations within 15 minutes of releasing the fix.

This incident underscores the effectiveness of Sui’s incident detection and response mechanisms. Automated alerts promptly notified engineers, who collaborated with the validator community to address the issue swiftly. Moving forward, Sui plans to enhance its testing systems to prevent similar bugs and streamline its build workflows to reduce incident response times.

As insurance executives, you understand the importance of risk management and mitigation. The Sui Network outage serves as a reminder that even the most advanced systems can experience technical glitches, and it’s crucial to have a plan in place to address them swiftly. With Riskwolf, you can turn real-time data into insurance. Using unique real-time data and dynamic risk modelling, we enable insurers to build and operate parametric insurance at scale. Simple. Reliable. Fast.

For more detailed information, please visit the Sui Foundation.