joystream
joystream copied to clipboard
[colossus] Identify problems in the Storage Node
1- Add storage node to the probe and assign it to the test channels only. 2- Enable tracing on the node 3- Define and document of the various storage events, specially process for a GET and POST events 4- Identify bottleneck for GET and POST event 5- Propose a fix
Issue: when retrieving a file storage node intermittently exceed 5sec. Data: https://grafana.joystream.yyagi.cloud/d/VbiCFzWMz/blackbox?orgId=1&refresh=5m&from=now-24h&to=now
Storage issue
Intro
As the subject of this ticket the issue being reported is that the current blackbox exporter configuration probes for a specific image file every 5min in order to construct data displayed on the SWG grafana board. Occasionally and not deterministically the response time exceeds (or becomes close) to 5s this is classified as timeout by the prober. This issue appears consistently on almost all the active colossus nodes. The goals of this report are:
- explain why the issue is not predictable
- attempt to pinpoint the bottleneck in the codebase
- detail the combination of factors explaining why is this happening
Analysis
Load testing
The analysis uses Colossus version: 3.8.1 and the tests have been executed using postman with the following HTTP request:
GET https://23.88.65.164.nip.io/storage/api/v1/files/1343
Server specs: The server is running a CPU with 16 cores and 32 GB of RAM with 1 TB bandwith, with Ubuntu Jammy Jellfish 22.04
I have used a 5 minute long load test with 20 virtual users where:
- the first minute is used to ramp up the number of virtual users from 1 to 20
- the remaining 4 minutes all 20 virtual users are sending the above request Max response time has been measured (together with average response time). I have recorded a loom video detailing most of the last 4 minutes displaying:
- incoming and outgoing network traffic (bandwidth) using
nload
(bottom left) - cpu and ram usage using htop (top left)
- colossus docker logs (center)
- postman load test max response graph (right)
Loom 📹 https://www.loom.com/share/ac1fdcca629343fc9472be48acf7c050?sid=6a9e8078-6fc5-4a42-87bb-da3a99638fe7
Key observations
- when the incoming traffic is high in value then the response time is low, this makes sense since every request is timely satisfied
- When the synching process is occurring then the incoming traffic is low and consequently the response time for the test request is 10-20x the time observed without sync process occurring
Colossus stack trace explanation
Main execution
Colussus is executed as a Single thread worker node.js process, this means that the node.js runtime uses a single core out of the 16 available at any time, (which core is established by the OS scheduler, that's why on the loom video you see the load changing accross cores). In particular the main competing functions started on the server entrypoint are two:
-
app.listen
the express app that accepts HTTP request (this is responsible for the incoming network traffic that you see on the nload interface) -
runSyncWithInterval
that periodically (every 1 minute as it is the default value currently) executes a sequence of tasks
Synching process
Without going into too many details I will explain how the runSyncWithInterval
works.
The sleep
and await performSync
line ensures that sleepInterval = 1 minute
is respected in between performSync
executions. This implies that performSync
execution start timestamp is: previous_performSync_start_timestamp + performSync_execution_time + 1min
and since performSync_execution_time
is variable then the prober will experience non deterministic timeouts accross time.
The performSync
function will compare local bucket state against the on-chain bucket state (coming from the QN) and it will prepare a list of objectsId
to add and delete then for each new asset to be added it calls getPrepareDownloadTasks
factory method that sets up other storage nodes urls from which the assets are fetched (this ensure that assets are propagated in between nodes hosting the same bucket). The getPrepareDownloadTask
will eventually cause the PrepareDownloadTask.execute
async method to be executed which internally uses node-cache: 5.1.1
to store valid and reliable storage operator urls (and also mark unrealiable ones) from which the asset file is then downloaded using DownloadTask.execute
, this should cause the outgoing network traffic to rise.
CPU bottleneck
There are segments on the loom video where the outgoing and incoming network traffic are both high, suggesting that the DownloadTask.execute
doesn't cause interference on the network total bandwith. What it emerges is that during the syncing process is that one core of the 16 available has a high load possibly above 80%, this is observable even when no GET request are initiated. This suggest that the app.listen
and runSyncWithInterval
are essentially competing for the same core resource usage and specifically this bottleneck seems to be caused by PrepareDownloadTask.execute
, furthermore inspecting the logs the high loads appears in conjuction with the cache lookup for valid operator urls on the node nodeCache
(inside getRemoteDataObjects
) or even the storage operator url selection as a whole. In any case the bottleneck appears to be due to the code insidePrepareDownloadTask.execute
Recommendation
I think at this point that the app.listen
part (i.e. the "rest api process") could be executed on its own thread, this will eliminate the competition for the same core with the synching routine. I have checked with @zeeshanakram3 and he assured me that cpu resource usage is a known problem in colossus and this fix is in the pipeline
Nice work taking these measurements. Which version of colossus did you test against?
Could you repeat the same test for latest version v3.10.0
I will try @mnaamani, with the blessing of @yasiryagi
I will try @mnaamani, with the blessing of @yasiryagi
If you haven't performed the benchmark yet, there is a v3.10.1 release docker image already published. joystream/storage-node:3.10.1
I have tried it with latest colossus and the problem seems to be gone, however there's still some issue when the asset file hash is mismatched https://www.loom.com/share/0e2ae18ec8fb410da80b3d37e065731b?sid=fcb2089e-3ab1-4eb1-acdf-48437f85e3af
Nice work. Just to clarify how many gigahertz per core were used in the test? Were you able to reproduce 5s delays? What is the consequent minimal server requirement for SP in terms of CPU? How do you suggest to improve the caching function for future huge numbers of objects?
I have tried it with latest colossus and the problem seems to be gone, however there's still some issue when the asset file hash is mismatched
Can you share a bit more about the problem you're describing? It's not immediately apparent to me from the loom video