flow-go icon indicating copy to clipboard operation
flow-go copied to clipboard

[Access] Resource contention from local data lookups

Open peterargue opened this issue 1 year ago • 7 comments

Problem Definition

Recently the public nodes encountered high response time. This started after rolling out support for using local data for transaction results. At its worst, the average response time for impacted nodes was around 1 seconds. This is about a 20x increase from normal behavior.

Additionally, we observed that writes to the db were also slow. Some nodes fell several hundred blocks behind indexing collections. Collections are relatively large data structures and involve multiple writes per transaction.

Here's a CPU profile from one of the nodes at the time, zooming into the API traces: image

Most notable is that all of the endpoints spend a majority of CPU time in runtime.asyncPreempt within badgerDB seeks. runtime.asyncPreempt indicates that a goroutine was forcibly preempted during execution, typically during long running loops that never release their context. In our case, they were converting blobs of data from the db from bytes to uint32 during seek.

Some observations

  • Large numbers of concurrent reads/writes to the db is causing general slow responses for all queries
  • Most of the slowness is from database reads (seeks)
  • Events for a block are looked up for both event and tx result queries, so caching could make a significant difference

Needs further investigation:

  • Is gc causing/making this worse?
  • What do the mutex/block traces look like?

Takeaways

The new local data features makes resource contention more likely, especially on high throughput nodes like the public endpoints.

Experiment with the following:

  • [x] Spread the request volume across a larger number of nodes. The reduced requests per node should reduce the contention
  • [x] #5754
  • [ ] Investigate caching in the tx results endpoint
  • [ ] Investigate go GC tuning for ANs
  • [ ] Look into disk configurations and options to use lower latency drives
  • [ ] Increase the number of CPU cores per VM

peterargue avatar Apr 19 '24 20:04 peterargue

I remember commenting on this ( or creating an issue ) but somehow couldn't find now. (edit: found https://github.com/onflow/flow-go/issues/961 )

Here is what I remember maybe it helps. There is a cache on storage layer and it is for 1000 blocks currently, and it is LRU cache. What happens is when someone is requesting old height range ( someone is indexing old blocks etc, and they request max amount of blocks (250 iirc) ) they are evicting recent blocks from cache. This may be causing recent requests to go to badger instead of cache.

bluesign avatar Apr 20 '24 16:04 bluesign

I remember commenting on this ( or creating an issue ) but somehow couldn't find now. (edit: found #961 )

Here is what I remember maybe it helps. There is a cache on storage layer and it is for 1000 blocks currently, and it is LRU cache. What happens is when someone is requesting old height range ( someone is indexing old blocks etc, and they request max amount of blocks (250 iirc) ) they are evicting recent blocks from cache. This may be causing recent requests to go to badger instead of cache.

yea, that's very likely causing issues here. I think a solution like withNoCacheOnRetrieve could be helpful. I'll do some experimentation.

peterargue avatar Apr 22 '24 18:04 peterargue

Another interesting section from a CPU profile: image

This is showing bitswap using 18% of the node's CPU calculating the size of blobs. It's possible to add a cache for bitswap blobs, which should reduce the db lookups needed for this operation.

peterargue avatar Apr 28 '24 16:04 peterargue

Profiles from one of the ANs experiencing slow response times:

CPU: image

Note:

  • 17% of CPU is used by bitswap looking up the size of recent data blobs
  • 10% is used for DHT related activities
  • 14% is used by go garbage collection (this varies over time)
  • 30% is used by the APIs (including profiling)

Alloc Space: image

Note:

  • 52% of allocations are from syncing execution data (this is typical)
  • 14% is from DHT operations
  • 18% is from API requests (including profiling)

Inuse Space: image

Note:

  • 50% of memory is used by badger for the L0 cache
  • 30% is used by badger for compaction
  • 6% is used by the DHT to store the provider map
  • Less than 5% is used by the APIs (including profiling)

Observations

The 17% cpu used by bitswap looking up the size of recent data blobs van be eliminated by enabling a cache on the blobstore.

10% of CPU for DHT operations and some amount of the garbage collection needed to free DHT allocations could be eliminated by disabling the DHT.

peterargue avatar Apr 28 '24 16:04 peterargue

Here's an execution trace from a node experiencing slow response times: image

Note: while there are large blocks of time where a majority of the available cores are used by long running tasks, there are always several available cores, and many that are relatively underutilized.

peterargue avatar Apr 28 '24 17:04 peterargue

I attempted to reproduce this on testnet pushing one AN to 87% CPU utilization with:

  • 1000 request per second across calls to get blocks, events, txresults and accounts.
  • 200 execution data streams

This node was configured to use local data for scripts, events and tx results.

During this test p90 grpc response time increased from 6ms to 30ms and rest response time 5ms to 25ms. Response times did not run away to hundreds of milliseconds.

I also monitored profiles, and didn't see runtime.asyncPreempt show up on any of the API cpu samples.

Update:

I also attempted to reproduce this on mainnet.

  • Peak rps: 800
  • Peak p90 response time 350ms
  • Peak CPU 97%

Peak latency occurred when CPU utilization was high. Profiles from that time show high most time was spent handling data conversions in the API requests instead of db seeks. None of the markers of the issue on QN seemed to occur.

peterargue avatar Apr 28 '24 17:04 peterargue

Next steps:

  • Experiment with disabling DHT on mainnet nodes (lower baseline CPU)
  • Experiment with directing public traffic to a Flow Foundation access node to test if the issue is specific to node infra/config.

peterargue avatar Apr 28 '24 17:04 peterargue

We disabled the DHT on several nodes, but were never able to reproduce this issue on the flow foundation ANs. After migrating to new infra, profiles from the QN nodes also stopped showing the contention. I'm going to close this out for now, but may reopen if it comes up again

peterargue avatar May 29 '24 23:05 peterargue