Skip to main content

Network Disruption Postmortem: Week of 2021-11-15

· 15 min read

 

Last week the Helium Network experienced several disruptions. Throughout the disruptions, there was no risk to HNT holders, and while some devices were unable to use the Network for data transfer, the majority of devices were unaffected.

The Helium Blockchain has been running without serious interruption for over the two years and over 1 million total blocks.

The goal of this postmortem is to provide a transparent and technical overview of the events that took place, why they occurred, and what the Helium community and core developers did to not only address the issues, but improve the network so it can continue to scale. The performance improvements developed during the week and deployed by Hotspot owners and Validator operators will benefit the network well beyond resolution of these disruptions as it continues its unprecedented growth.

Huge Thanks

We’d like to take this opportunity to give a HUGE THANKS to our community for your support and patience, especially our Validator operators who quickly updated their nodes with the latest releases to help resume normal block operations.

Summary of Disruptions

On 2021-11-15 at approximately 2000 UTC the Helium Network was offline for approximately 30 hours and returned to normal operations on 2021-11-17 0200 UTC. All blockchain activity was halted during this time.

On 2021-11-20 at approximately 1805 UTC, the core developers were alerted of another outage. The Helium network was offline for approximately 12 hours. During this time all blockchain activity was halted.

On 2021-11-23 at approximately 0045 UTC, a chain variable was activated that caused some miners to stop syncing at block 1,107,995. It was discovered that the root cause was a nonce disagreement when miners consumed a snapshot including the rescue block from the second outage. Some Hotspots and Validators were unable to sync the blockchain for approximately 4 hours. The blockchain continued moving at this time. There was no impact to Proof-of-coverage and packet transfer for those Hotspots that remained in sync.

Normal operations of the network were restored with the help of the Helium community who continue to be a strength of the network, specifically Validator operators who worked closely with the core developers to timely update their Validators with releases containing performance improvements to proceed past the bottleneck caused by an abnormally large

Timeline of Events

Note: Approximate times are used for some of these events as the goal is to provide the community with an overall sequence of events.

2021-11-15 Outage

  • 2021-11-15 20:00 UTC core developers identified a community-hosted Router causing a large number of state channel dispute transactions which ultimately caused an extremely large block (approx. 140MB) and created a bottleneck halting the blockchain
  • 2021-11-16 01:30 UTC core developers worked with the community member to apply a fix and prevent recurrence of the issue and later also updated other Router instances
  • 2021-11-16 03:00 UTC placed API into maintenance mode preventing the build-up of pending transactions which would cause additional issues once block production resumes
  • 2021-11-16 07:34 UTC the team deployed Validator release v1.5.0 with performance updates to allow Validators in Consensus Group to break the bottleneck by consuming the large block and return to normal block production
  • 2021-11-16 09:24 UTC the team continued to monitor Consensus Group Validators to pick up the new release and also tagged updated versions of blockchain-node, blockchain-etl, and router
  • 2021-11-16 20:58 UTC the core developers released Validator v1.5.5 with further performance improvements identified after observing v1.5.0 still struggling with the large block.
  • 2021-11-16 23:50 UTC the core developers are advised by a Validator provider that their nodes may not reenter consensus due to lost consensus data during the upgrade process.
  • 2021-11-17 02:00 UTC Block production resumes.
  • 2021-11-17 08:11 UTC the core developers tagged 2021.11.17.2 miner firmware emergency release including updated blockchain-core, and snapshot to move beyond the large block, so Hotspots could resume syncing, participate in POC, and data transfer activity.
  • 2021-11-18 02:42 UTC Validator v1.5.7 was released which included fixes to the transaction manager, improves block synchronization across the network, and transaction handling times.
  • 2021-11-18 16:18 UTC a potential deadlock issue was discovered and Validator v1.5.8 was released to fix it.

2021-11-20 Outage

  • 2021-11-20 18:05 UTC the core team was alerted of a long block forming as a result of a bug in state channels from the Discovery Mode router. Earlier mitigations by the core team prevented this block from becoming too large.
  • 2021-11-20 18:30 UTC the core team identified some Hotspots that were unable to process that block.
  • 2021-11-20 18:51 UTC the app team decided to turn off the Discovery Mode feature to prevent further state channels from opening and closing, creating large blocks in the process. At this time, the blockchain was still moving.
  • 2021-11-20 19:11 UTC the core devs identified another state channel closing and a potential large block. Attempts were made to generate a snapshot for the network to consume to move past the block.
  • 2021-11-20 20:50 UTC unable to get a snapshot, the core devs notified the community of a blockchain halt.
  • 2021-11-21 01:11 UTC the core team released Validator v1.5.12 to the consensus group to get past the block.
  • 2021-11-21 06:52 UTC unable to get the network past the block, the core team and members of Decentralized Wireless Alliance proposed a Rescue Block, which would reelect the Consensus Group and a chain variable change to close state channels used by Discovery Mode.
  • 2021-11-21 07:52 UTC The Rescue Block was issued, and block production resumed shortly after.
  • 2021-11-21 13:22 UTC a Hotspot firmware release was made available to all manufacturers to resume blockchain sync, Proof-of-Coverage activity, and data transfer
  • 2021-11-21 20:48 UTC a second firmware release was made available to all manufacturers with lower-memory Hotspots to make it easier to follow the blockchain

2021-11-23 Incident

  • 2021-11-23 00:45 UTC a proposed chain variable to adjust validator penalties was activated
  • 2021-11-23 00:46 UTC the activated chain variable immediately caused syncing issues at block 1,107,995.
  • 2021-11-23 01:04 UTC blocks were still being produced by the Consensus Group and some Hotspots were still syncing and participating in mining. The team continued investigating the root cause.
  • 2021-11-23 01:26 UTC the core developers identified that the issue was related to a snapshot that included the rescue block and the recently deployed chain variable. Only Hotspots and Validators that consumed the snapshot were affected.
  • 2021-11-23 02:32 UTC the core team concluded a snapshot would resolve this issue and began finding a suitable candidate.
  • 2021-11-23 02:58 UTC unable to find a suitable snapshot, the core team looked to issue a Hotspot firmware release to realign nonce values between affected and unaffected miners.
  • 2021-11-23 04:10 UTC the team has a firmware candidate 2021.11.22.0 and starts beta testing on known stuck Hotspots.
  • 2021-11-23 05:25 UTC beta testing results are positive and the core team releases 2021.11.22.0 for GA and advises all makers to update to this release.
  • 2021-11-23 06:28 UTC another build 2021.11.22.1 is made available for makers if their Hotspot fleet experiences out-of-memory issues.
  • 2021-11-23 UTC Hotspot syncing resumes for those stuck on block 1,107,994 and Proof-of-Coverage activity resumes.

Event Details

2021-11-15 Outage

The core developers with a community member identified an issue impacting the blockchain that started with abnormally long block times then evolved into a full halt 2021-11-15 20:00 UTC. A bug in Router code that triggered a community-hosted Router instance to propagate many state channels dispute transactions was exacerbated by the loading of a recently released chain variable. This combination resulted in an extremely large block (around 140 MB), which was orders of magnitude larger than typical blocks.

Validators in the Consensus Group were unable to consume this large block causing a bottleneck that ultimately resulted in a halt impacting all blockchain activity including PoCs, challenges, and more.

On 2021-11-16 a few hours later (approximately 01:30 UTC) the core developers worked with the community member to apply a fix to the community-hosted Router intended to prevent a recurrence and this patch was released to other Router instances shortly after.

The team continued to work on addressing the large block that caused the bottleneck, including fixing the boot looping caused by the large block, and pushing an update to increase the speed to close state channels.

Since blockchain activity remained halted at approximately 03:00 UTC, the core developers made the decision to put the API into maintenance mode. This meant all new submitted transactions would fail including those from the wallet (for example, payment, add Hotspot, assert location, transfer Hotspot, etc). This decision prevented the build up of a huge queue of commands that would have overwhelmed the servers once the blockchain returned to normal operations.

The core developers worked on a number of performance fixes to allow Validators in the Consensus Group to consume and move past the large block and allow validators to resume normal block production.

At 07:34 UTC the core developers released a Validator update for Validators in the Consensus Group to urgently update. Two-thirds + one of the members were required to update with the release to continue to produce blocks and return the network to normal operations.

During this time, work also began on an emergency release for Hotspot firmware, once a snapshot had been agreed upon by the Consensus Group.

By 09:24 UTC the team had tagged updated versions of blockchain-node, blockchain-etl, and router and announced to the community to update to these latest versions to ensure smooth operations once block production had resumed.

On 2021-11-16 at 20:58 UTC the core developers released Validator v1.5.5. This release includes a number of performance improvements and is designed to allow Validators to consume and manage the large block causing the bottleneck and resume normal block production.

However, to create a snapshot to update Hotspots a minimum number of blocks need to be added to the chain so that the snapshot does not include the large blocks itself.

With block production returning to normal at approximately 0200 UTC on 2021-11-17, the core developers needed to wait until a sufficient number of blocks were formed before they could create a miner release that contained updated blockchain-core, blessed snapshot, and allowed Hotspots to resume syncing, POC, and data transfer activity.

On 2021-11-17 at 08:11 UTC the core developers pushed the miner firmware release with an updated blockchain-core, and snapshot to move beyond the large block, so older Hotspots could resume syncing, participate in POC, and data transfer activity.

Core developers released Validator v1.5.7 on 2021-11-18 at 02:42 UTC which included fixes to the transaction manager, which was struggling to detect it had synced with the chain, and provided performance improvements including improved blockchain synchronization across the network and faster transaction handling times. Hours later at 16:18 UTC the team released Validator v1.5.8 to fix a discovered deadlock issue.

2021-11-20 Outage

On 2021-11-20 18:05 UTC, the core Helium developers received a notification that it was taking longer than normal to form a new block (typically blocks are formed every minute).

Upon further investigation, it was identified that a state channel processing data transfer packets for Discovery Mode was having issues closing due to a bug. When this happens, a large block can start to form. Last week’s chain halt also happened when a large, 140mb sized block formed, which Hotspots and Validators were unable to ingest. Fortunately, last week’s chain halt also allowed the core devs to implement a variety of fail safe measures to prevent large blocks of that size from happening again, as outlined in the Engineering Blog. However, a second large block still formed this weekend, which was determined to be 25mb in size.

Around 2021-11-20 18:30 UTC, the team identified a subset of Hotspots with 1GB of ram that would not be able to ingest the block. At this point, the Discovery Mode router had several more open state channels that could potentially become degraded and cause long blocks.

To avoid future degradation of the Network, the Helium App team decided to disable the Discovery Mode feature until further notice. An app notification, status incident, and announcement on Discord regarding the large block and disabling of Discovery Mode were sent. During the day, additional progress updates were also posted on https://status.helium.com and Discord ​​as the team continued to monitor block formation.

Once the team identified the issue (long blocks from state channels) and root cause (Discovery Mode router), the team issued a status update and made a discord announcement.

The chain was still moving at this point.

At 2021-11-20 19:11 UTC, another state channel from the Discovery Mode router was attempting to close, creating a long block. The team decided to create a new snapshot past the long block in an effort to get the Network past it.

Unable to get a snapshot, and with more state channels closing with the same problem, the core developers issued a formal blockchain outage at 20:50 UTC.

The core developers then began actively working with Validators to push out a release to get the Consensus Group past the large blocks. At 2021-11-21 01:11 UTC, Validator build v.1.5.12 was announced, with Validators recommended to update as soon as possible.

At 2021-11-21 06:52 UTC, an announcement went out from the Decentralized Wireless Alliance, in conjunction with several core developers, on a proposed Rescue Block, which would reelect the Consensus Group and close state channels used by Discovery Mode. The Rescue Block was issued during the following hour, and block production resumed shortly after.

At 2021-11-21 13:22 UTC, a new required Hotspot firmware release was made available to all Hotspot manufacturers that would enable their fleet of Hotspots to sync and begin to participate in Proof of Coverage and data transfer once again. A second release was announced at 20:48 UTC, allowing lower-memory Hotspots to follow the chain more easily. Read more about both releases on the Engineering Blog.

2021-11-23 Incident

The core team with DeWi decided to activate HIP 47 to increase DKG penalties for Validators. A DKG penalty can occur if an election fails, and the Validator was part of the reason for the election to fail. This can occur throughout the DKG process or during the signature phase. Failed elections are significant to the network as they prevent new groups from being elected and rewards to be issued to all network participants including Hotspots.

At 2021-11-23 00:45 UTC the proposed chain variable to adjust DKG penalties was activated per HIP47.

Shortly after, the activated chain variable immediately caused syncing issues at block 1,107,995. At this point, blocks were still being produced by the blockchain and Consensus Group, Hotspots were still syncing and participating in Proof-of-Coverage. Packet transfer was not interrupted at this time. The team continued investigating the root cause.

At 2021-11-23 01:26 UTC the core developers identified that the issue is related to a snapshot that included the rescue block issued at 1105523 and the recently deployed chain variable. Only Hotspots and Validators that consumed the snapshot are affected.

At 2021-11-23 02:32 UTC the core team believes a snapshot will resolve this issue and begins finding a suitable candidate. The team looked to identify a snapshot small enough for Hotspots to consume to get them past the block.

Unable to find a suitable snapshot, the core team looks to issue a Hotspot firmware release to realign nonce values between affected and unaffected miners. The proposed change was submitted and reviewed. The team looks for Hotspots that are stuck syncing at 1,107,994 as beta testing candidates. Several were identified.

At 2021-11-23 04:10 UTC the team has a firmware candidate 2021.11.22.0 and starts beta testing on known stuck Hotspots. The testing successfully unstuck affected Hotspots and the team let the update soak for an hour.

At 2021-11-23 05:25 UTC the core team releases 2021.11.22.0 for GA and advises all makers to update to this release. All makers were alerted of this required update.

At 2021-11-23 06:28 UTC another build 2021.11.22.1 is made available for makers if their Hotspot fleet experiences out-of-memory issues when loading a snapshot. This was not a required update.

The team announces an all clear. Hotspot syncing resumes for those stuck on block 1,107,994 and Hotspots could participate in Proof-of-Coverage.

Key Fixes

This is a shortlist of key fixes to prevent the recurrence of the events that caused the initial outage.

A complete list and specific details to address these events can be found on the engineering blog.

  • A few hours after the cause was identified the team quickly provided a fix to ensure the root cause prevented the issue caused by * Router from recurring
  • Significant performance enhancements were implemented (example, during testing saw Validators go from 600% CPU utilization to 6%) that will continue to benefit the network during normal operations
  • A cap to limit the size of non-reward blocks to a maximum (initial default 50 MB). This limit will be associated with a chain variable to allow readjustment in the future. Blocks that exceed this threshold will be trimmed to fit.
  • A cap of individual consensus member transaction proposals to 1mb was implemented as large proposals were identified as a significant performance bottleneck.
  • Snapshot loading, a crucial tool in skipping low-resource hotspots over large blocks, has been improved to require dramatically less memory.

Although Helium has rapidly become the world’s largest contiguous wireless network, it is still early days and issues will happen along the way. ​However, we believe that with the help of our community, over time, the network will continue to grow and increase its resiliency.