App Outage Post Mortem
This post-mortem describes the events leading up to the app outage on October 17, what happened, and what the team is doing moving forward.
On Friday October 15, the team deployed a new tileserver database to bring back the data-packet map layer on Explorer.
What is Tileserver? Tileserver is the database that generates hexagons for both App and Explorer. Whenever a Hotspot changes location, the tileserver gets updated and generates a new hexagon to display to users. This happens on a schedule.
This new tileserver had a feature called parallelization which generated unique hexagon IDs twice for efficiency and to reduce API calls. This parallelization feature did not work as expected on the mobile app and was crashing its native SDKs. The issue was not immediately apparent and the problem grew overtime, with the symptom identified as an app crash.
As it crashed for more users, the queries that ran on app launch, namely
hotspot rewards that calculates the 24h rewards for all user’s Hotspots, continued to retry, even when the app crashed. StakeJoy API and api.helium.io (Helium’s API) saw this as “spam” as it peaked at 900 requests per second on Sunday afternoon. Each app launch (and crash) effectively created a spammer that kept retrying the rewards query.
On Sunday Oct 17 around 4pm PST, Helium and StakeJoy rightly blocked the IP addresses that were spamming, that unbeknownst to them, was from wallet-api. Later that evening, a message was returned to the blocked IP addresses to contact the API provider to discuss next steps. This message, when received by wallet-api, could not be parsed. The messages piled up and crashed the servers.
What is wallet-api? Wallet-api is a service the app uses to query the API for data, to generate notifications, and act as an intermediary for discovery mode and transfer Hotspot.
Events on October 17, 2021
On Oct 17 at 8:30pm PST, the team issued a status update notifying the community of the app outage.
At 9:40pm PST, the team narrowed down the root cause (crashing native SDKs due to duplicate IDs) and recognized the wallet-api as the IP address making the hundreds of requests to both API endpoints. The team requested the IP-ban be lifted.
At 10:20pm PST, the team resolved the issue by pointing the app to the old tileserver (where parallelization did not exist, thus no duplicate IDs), saw the app launch without crashing, and requests to both API endpoints eased off.
- The bug fix to allow parallelization exists but is owned by a different team/company. The core devs at Helium will attempt to fork that fix and merge it into our own repositories, but doing so requires additional work. In the meantime, the team will continue to use the old tileserver to generate hexagons for the next few weeks. Note that tileserver updates every few hours to draw hexagons for newly added and asserted Hotspots - this updater will be running this evening. You may find that Hotspots are missing hexagons until this is done.
- Removed the rewards query retry logic so if it fails (API not available etc) it won’t retry. Users can manually force a retry by tabbing away from the Home tab, or closing/reopening the app.
Smarter Hotspot Rewards Query
- In addition to removing the retry logic, a future app release will introduce “Lazy Load Hotspot Rewards” - this means that the app will query Hotspot rewards as the Hotspot moves into focus on screen instead of loading it all at once on app launch. This should not have user-impact.
Add User Agent
- Helps API providers identify the service requesting data by adding a user agent to avoid accidentally being banned. The team considered a static IP for wallet-api, but doing so would be cost-prohibitive.
None of the above next steps require an app release aside from lazy loading Hotspot rewards. All other changes are being implemented and/or deployed.