Skip to main content

Network Disruption Postmortem: 2022-10-26

· 6 min read

 

On Wednesday, 2022-10-26, at approximately 1:57 AM UTC, Validators stopped producing blocks at block 1,587,313, and a chain halt occurred. Consequently, two days later, at 3:57 PM UTC, a second chain halt occurred at block 1,589,496. Normal chain operations resumed approximately 32 hours later at 2022-10-30 1:32 AM UTC.

Throughout the disruption, Hotspots were not able to fully participate in Proof-of-Coverage. All funds were safe, but pending transactions were delayed until the halt was resolved. Pending Transactions include adding a Hotspot, changing a Hotspot Location, transferring a Hotspot, payments, and token burns. During the disruption, data transfer was unaffected by the chain halt. Over the next day, the performance of the chain was not as expected.

Cause

The core team identified a root cause of the chain halt. The chain halt was triggered as a result of activating the latest chain variables, which resulted in ledger drift caused by the double application of election penalties on the Consensus Group that the rest of the chain did not agree with.

The core developers have identified two main issues that contributed to this chain halt:

  • Consensus penalties got applied to the lagging ledger differently than they did on the leading ledger.
  • Some Consensus Group members had their lagging ledgers fall behind an additional block relative to other members.

The first issue caused snapshots to stop working (snapshots were taken via the lagging ledger, which had become corrupted). This corruption prevented nodes that saw an invalid block and tried to recalculate the ledger to continue syncing because they started the recalculation based on the corrupted lagging ledger.

The second issue caused “split rounds”: some Consensus Group members had 51 blocks (historical ledgers) and other members (correctly) only had 50.

Certain transactions (including poc_receipts) define transaction validity partly in terms of whether a historical ledger is available at a certain height. This meant every time a poc_receipt transaction from the 51st previous block was submitted, the Consensus Group members would disagree. Some members would try to include that transaction in the proposed block while others would not, causing the halt.

This event correctly activated the “round skip timer” which allowed the Consensus Group to drop transactions in the proposed block and proceed until it encountered another transaction from prior to the previous 50 blocks.

The Validator release forced no more than 50 blocks to be retrievable, even if more were available locally so that particular issue was no longer possible.

Recovery

The chain variables were rolled back, but the Consensus Group began to drift away from the main chain.

To prevent overwhelming Hotspot connections, Validator operators were asked to disable gRPC ports (default 8080).

To recover the team created a rescue block at (7:08 PM), and assembled a new Consensus Group to restart the chain.

Resolution

The core team released candidate Validator v1.15.8 to resolve the chain halt by forcing no more than 50 blocks to be retrievable.

Looking forward, the core team has released Validator beta build, v1.15.10. Among the key fixes, this release includes an update to change snapshot generation to avoid using the possibly corrupt lagging ledger in favor of a known-good ledger checkpoint.

Power of Community

Thank you again to our Validator community for working with us and acting swiftly to help resolve this issue. During the chain halt and resolution, Validators and the entire community are asked to slop transactions or refrain from submitting duplicate transactions. Your continued dedication and support have such a huge positive impact on the entire Helium community that it does not go unnoticed.

Timeline of Key Announcements

Throughout the disruption, the team worked to keep users up to date on the status and required actions. All below times in UTC.

2022-10-26

1:57 AM The core team started to investigate the first chain halt at block height 1,587,313 with a plan to update the community as soon as they learned more.

2:15 AM Root cause required a rescue block to solve. Validator operators were instructed to close port 8080. The chain remained halted.

3:08 AM Core developers assembled a new Consensus Group. After the group is elected, the core team would issue a rescue block to get past the halt. The chain remained halted.

4:16 AM The chain resumed. A new Consensus Group was assembled and began creating new blocks. Other services were being monitored, including the ETL for the API that serves data to the explorer and other apps.

5:08 AM The remainder of Validators were catching up to the new block height. The core developers continued to monitor the chain.

6:27 AM Some validators were stuck at block height 1,587,339 or prior to 1,587,313. They issued commands for Validators stuck at the latter.

2022-10-27

5:21 PM The chain is operating, and transactions are clearing, but many Validators remain offline. For hotspot owners, this means Proof-of-Coverage activity was degraded. The core developers were still looking for a root cause. Validators were instructed to continue to force re-syncs of blocks, and an emergency release was expected.

10:14 PM PoC activity had somewhat recovered, and the chain was operational. The core developers notified Validator operators that they would release a snapshot to get all Validators back to operating normally. This is a temporary solution as they continue to solve the root cause and should help the network recover. Another chain halt was still possible, but the team and operators continued to monitor the chain.

2022-10-28

3:45 AM The core developers released a blessed snapshot. Many participants (Validators, Router, ETL, Node) remained out of sync. All participants were instructed to download the snapshot or pull the latest-snap from snapshots.

3:57 PM The core developers noticed another chain halt at block 1,589,496.

4:59 PM The core team continued to monitor the Consensus Group and instructed validators to close port 8080.

6:03 PM The Core team continued to monitor the Consensus Group. Which, at this time, had agreed on a block and began moving into block 1,589,497. Validators remained instructed to close port 8080.

7:06 AM The core team continued to monitor the Consensus Group. Which at this time was stuck again on block 1,589,531.

9:06 PM The chain had progressed a bit further, but it was yet to form a new Consensus Group, leaving rewards delayed and PoC degraded. The community was asked to be patient at this time, and the resolution would likely go through the weekend.

2022-10-29

12:51 AM The chain halt resolved itself. Regular blocks, Consensus Group elections, and rewards were back up and working.

11:36 PM Thanks to the logging introduced in the last few days, the core developers were able to identify the root cause of the various halts.

2022-10-30

1:32 AM The chain resumed producing blocks, and a new Validator version 1.15.8 was released to address the latent bug.