go icon indicating copy to clipboard operation
go copied to clipboard

[epic] services/horizon: reduce history archive egress cost caused by horizon

Open mollykarcher opened this issue 1 year ago • 1 comments

What is the problem?

Egress cost for history archives are currently the highest/most burdensome cost for validators. After some investigation into what clients were downloading these files, it was discovered that Horizon accounts for an unreasonably large portion of this. As a comparison, the captive core running on the same machine utilizes ~15MB of bandwidth to download the same files that Horizon uses ~3.7TB of bandwidth to download

See the following 2 threads for additional context:

  • https://stellarfoundation.slack.com/archives/C02B04RMK/p1705104924671569
  • https://stellarfoundation.slack.com/archives/G01FUHTUQ8Z/p1705106425831119

What would you like to see?

Overall, a root cause analysis of the cause. We should definitely determine if just Horizon (or also RPC / Hubble / etc) is effected

  • [x] Add explicit prometheus metrics for history archive access on the go/Horizon side
    • [x] Horizon: https://github.com/stellar/go/issues/5161
    • [x] RPC: https://github.com/stellar/soroban-rpc/issues/8
    • [x] Hubble: (jira) @chowbao
  • [x] Include visualizations for the above metrics in the Horizon dashboard [puppet-v4/3656] (and RPC dashboard, if it is also affected)
  • [x] Depending on root cause/culprit (analysis here)
    • [x] Remove unnecessary HA downloads
    • [x] Locally cache/save bucket list files temporarily on-disk to reduce bandwidth costs: #5165
  • [ ] Add test coverage
  • [x] Update our production deployment to use history_archive_urls for all nodes in our quorum set for Horizon, RPC, and Hubble (currently, we just use 1). horizon deployments done puppet/3651 and puppet/3642
    • [x] horizon: https://github.com/stellar/go/issues/5164
    • [x] RPC testnet and pubnet
  • [x] Validate ArchivePool random access behavior is working given we've never exercised it in our deployment due to the above (encapsulated in https://github.com/stellar/go/issues/5164)
  • [ ] Assess/update any external documentation or example configs where we are explicitly setting history_archive_urls to be a single url and/or just SDF's nodes
  • [ ] Lower state verification frequency to 1x/hour

What alternatives are there?

Given that the majority of this is incurred on files that are part of the current/active bucketlist, we believe this may be related to state verification. There are configuration values which allow us to run state verification less frequently. However, this should be considered a stop-gap/interim solution, as we will never have control over the configuration values used by external Horizon operators.

mollykarcher avatar Jan 16 '24 16:01 mollykarcher

Add test coverage

i think this requirement is met internally from go repo with archive pool unit tests which has increased coverage on archive pool user agent usage and archive pool caching.

sreuland avatar Feb 13 '24 17:02 sreuland

noticed this lingering in-process, two sub-tasks remaining. @Shaptic , it looks like you had initiated some work on Lower state verification frequency to 1x/hour with https://github.com/stellar/puppet-v4/pull/3729

I can take the Assess/update any external documentation or example configs... sub-task

sreuland avatar Mar 26 '24 17:03 sreuland

@sreuland nice - I bumped the thread to get that merged, thanks!

Shaptic avatar Mar 26 '24 19:03 Shaptic

Assess/update any external documentation or example configs where we are explicitly setting history_archive_urls to be a single url and/or just SDF's nodes

I'm actually going to close this epic with this still "open". I took a cursory pass through public developer docs and we're good there. We do have some things in internal/testing configs that people could be looking at. But more importantly, we're going to come back to this with more weight behind it. This didn't get selected for Q2, but there is going to be a broader initiative/OKR under governance (likely in Q3/Q4) that specifically focuses on upping the quality of other validator's archives and pushing people off of just using ours. So we'll revisit all of this then.

mollykarcher avatar Apr 09 '24 15:04 mollykarcher