Recently we’ve seen some gossip issues, where the consensus group and some small number of Hotspots manage to forge ahead on schedule, but the bulk of the Hotspots and their ancillary services fall, in lockstep, behind. Since Proof of Coverage (PoC) receipts are only valid for a short period of time, most of them begin to fail and more and more nodes fall out of step with the consensus group. At some point this morning, we experienced a prolonged and severe drop in PoC receipt rates (challenge rates are still mostly fine, due to recent fixes). Manually syncing the full nodes that we maintain in the cloud did not manage to fix the issue, so we’re doing a single PR release that adjusts sync to be more aggressive, in hopes that this lessens the issue in the future. There will no doubt be a few more minor gossip adjustments in the coming days.
The PR (details below) changes which time Hotspots pay attention to when they try a new sync, which should allow them to sync more often. Without a long delay (originally introduced to fix an issue with clock skew, but too broadly applied) we should be able to keep up with gossip more consistently.
- Sync Strategy Adjustment: PRs: core/409
On the morning of Tuesday, March 31, PoC receipts went to close to 0 for a quite a long time. In addition to manual intervention, this PR was rushed into an image which was briefly betaed and then deployed to GA around 1:30PM PDT.