1- Add storage node to the probe and assign it to the test channels only. 2- Enable tracing on the node 3- Define and document of the various storage events, specially process for a GET and POST events 4- Identify bottleneck for GET and POST event 5- Propose a fix

Oct 31 '23 05:10 chrlschwb

Issue: when retrieving a file storage node intermittently exceed 5sec. Data: https://grafana.joystream.yyagi.cloud/d/VbiCFzWMz/blackbox?orgId=1&refresh=5m&from=now-24h&to=now

Dec 19 '23 00:12 yasiryagi

Storage issue

Intro

As the subject of this ticket the issue being reported is that the current blackbox exporter configuration probes for a specific image file every 5min in order to construct data displayed on the SWG grafana board. Occasionally and not deterministically the response time exceeds (or becomes close) to 5s this is classified as timeout by the prober. This issue appears consistently on almost all the active colossus nodes. The goals of this report are:

explain why the issue is not predictable
attempt to pinpoint the bottleneck in the codebase
detail the combination of factors explaining why is this happening

Analysis

Load testing

The analysis uses Colossus version: 3.8.1 and the tests have been executed using postman with the following HTTP request:

GET https://23.88.65.164.nip.io/storage/api/v1/files/1343

Server specs: The server is running a CPU with 16 cores and 32 GB of RAM with 1 TB bandwith, with Ubuntu Jammy Jellfish 22.04

I have used a 5 minute long load test with 20 virtual users where:

the first minute is used to ramp up the number of virtual users from 1 to 20
the remaining 4 minutes all 20 virtual users are sending the above request Max response time has been measured (together with average response time). I have recorded a loom video detailing most of the last 4 minutes displaying:
incoming and outgoing network traffic (bandwidth) using nload (bottom left)
cpu and ram usage using htop (top left)
colossus docker logs (center)
postman load test max response graph (right)

Loom 📹 https://www.loom.com/share/ac1fdcca629343fc9472be48acf7c050?sid=6a9e8078-6fc5-4a42-87bb-da3a99638fe7

Key observations

when the incoming traffic is high in value then the response time is low, this makes sense since every request is timely satisfied
When the synching process is occurring then the incoming traffic is low and consequently the response time for the test request is 10-20x the time observed without sync process occurring

Colossus stack trace explanation

Main execution

Colussus is executed as a Single thread worker node.js process, this means that the node.js runtime uses a single core out of the 16 available at any time, (which core is established by the OS scheduler, that's why on the loom video you see the load changing accross cores). In particular the main competing functions started on the server entrypoint are two:

app.listen the express app that accepts HTTP request (this is responsible for the incoming network traffic that you see on the nload interface)
runSyncWithInterval that periodically (every 1 minute as it is the default value currently) executes a sequence of tasks

Synching process

Without going into too many details I will explain how the runSyncWithInterval works. The sleep and await performSync line ensures that sleepInterval = 1 minute is respected in between performSync executions. This implies that performSync execution start timestamp is: previous_performSync_start_timestamp + performSync_execution_time + 1min and since performSync_execution_time is variable then the prober will experience non deterministic timeouts accross time. The performSync function will compare local bucket state against the on-chain bucket state (coming from the QN) and it will prepare a list of objectsId to add and delete then for each new asset to be added it calls getPrepareDownloadTasks factory method that sets up other storage nodes urls from which the assets are fetched (this ensure that assets are propagated in between nodes hosting the same bucket). The getPrepareDownloadTask will eventually cause the PrepareDownloadTask.execute async method to be executed which internally uses node-cache: 5.1.1 to store valid and reliable storage operator urls (and also mark unrealiable ones) from which the asset file is then downloaded using DownloadTask.execute, this should cause the outgoing network traffic to rise.

CPU bottleneck

There are segments on the loom video where the outgoing and incoming network traffic are both high, suggesting that the DownloadTask.execute doesn't cause interference on the network total bandwith. What it emerges is that during the syncing process is that one core of the 16 available has a high load possibly above 80%, this is observable even when no GET request are initiated. This suggest that the app.listen and runSyncWithInterval are essentially competing for the same core resource usage and specifically this bottleneck seems to be caused by PrepareDownloadTask.execute, furthermore inspecting the logs the high loads appears in conjuction with the cache lookup for valid operator urls on the node nodeCache (inside getRemoteDataObjects) or even the storage operator url selection as a whole. In any case the bottleneck appears to be due to the code insidePrepareDownloadTask.execute

Recommendation

I think at this point that the app.listen part (i.e. the "rest api process") could be executed on its own thread, this will eliminate the competition for the same core with the synching routine. I have checked with @zeeshanakram3 and he assured me that cpu resource usage is a known problem in colossus and this fix is in the pipeline

Dec 29 '23 17:12 ignazio-bovo

Nice work taking these measurements. Which version of colossus did you test against? Could you repeat the same test for latest version v3.10.0

Jan 02 '24 15:01 mnaamani

I will try @mnaamani, with the blessing of @yasiryagi

Jan 03 '24 09:01 ignazio-bovo

I will try @mnaamani, with the blessing of @yasiryagi

If you haven't performed the benchmark yet, there is a v3.10.1 release docker image already published. joystream/storage-node:3.10.1

Jan 04 '24 05:01 mnaamani

I have tried it with latest colossus and the problem seems to be gone, however there's still some issue when the asset file hash is mismatched https://www.loom.com/share/0e2ae18ec8fb410da80b3d37e065731b?sid=fcb2089e-3ab1-4eb1-acdf-48437f85e3af

Jan 04 '24 08:01 ignazio-bovo

Nice work. Just to clarify how many gigahertz per core were used in the test? Were you able to reproduce 5s delays? What is the consequent minimal server requirement for SP in terms of CPU? How do you suggest to improve the caching function for future huge numbers of objects?

Jan 12 '24 22:01 traumschule

I have tried it with latest colossus and the problem seems to be gone, however there's still some issue when the asset file hash is mismatched

Can you share a bit more about the problem you're describing? It's not immediately apparent to me from the loom video

Jan 16 '24 13:01 kdembler

joystream
joystream copied to clipboard

[colossus] Identify problems in the Storage Node

Storage issue

Intro