Block Weather

While yesterdays fix for block times (see here) went well, a seemingly unrelated change has caused quite a bit of trouble today. This PR simplified our P2P relay system quite a lot and should allow us to get better performance for less memory. However, the random routine in our peerbook (which the PR uses heavily) does not filter for stale peer entries (which contain the actual IP addresses to open a network connection to). So, after 24h (the global staleness deadline) a large number of peers become un-dialable via the address which random returns. For various reasons, dials are done via a string representing the peer, rather than the peer entry itself. When random would return a stale peer, the dial would not be able to find that peer because its entry was stale and the downstream code was correctly filtering for stale peers. 24 hours after yesterday’s release, which caused all Hotspots to reboot and update their peer entries, this was affecting around 12% of all peers which returned from random.

This issue by itself was not a horrible problem. Mostly things would retry in this situation. But our gossip server does not appear to handle this situation correctly, and when a gossip worker dies due to the above misdial, it never seems to notice, so eventually the node cannot establish any gossip connections and the gossip network collapses like we have seen today. You can read this post for more context.

This release adds filtering both for the relay and for the peerbook random function, so now we favor newly updated random peers and never return a stale peer from the random function. We’ve also found and fixed the tracking of gossip workers, so if they crash their replacement is correctly tracked and used for sending gossip messages. In addition, it bumps the assumed valid block. The speed-ups for hbbft mentioned here will land in a release early next week.

Contents

  • Favor Recently Updated Peers for Relay: PR: libp2p/268
  • Filter Stale Peers Out of random/1,2,3,4 Returns: PR: libp2p/269
  • Track group workers by ref, not pid, so restarts can be handled: PR: libp2p/270

Plan

This release will be in beta starting around 3pm Thursday April 23rd, for a short time, and then will be released to GA.