elasticsearch
elasticsearch copied to clipboard
Rework shard snapshot workers
Currently, when starting the shard snapshot tasks, we loop through the shards to be snapshotted back-to-back. This will calculate whether/which changes in the shard there are to be snapshotted, write some metadata, and ONLY if there are changes and there are files to be uploaded, fork off the file upload calls. This could be further improved by parallelizing shard snapshot tasks (currently, only file uploads happen in parallel).
This change uses a worker pool and a queue to:
- Paralleize (with limited number of workers) shard snapshotting, and limit the number of the concurrently running snapshot tasks.
- Prioritize shard snapshot tasks over file snapshot tasks.
Closes #83408
Hi @pxsalehi, I've created a changelog YAML for you.
Pinging @elastic/es-distributed (Team:Distributed)
I wonder if this approach is necessarily better. IIUC, we currently keep the one thread looping until the end. In the new approach, we risk the first (or an early) shard filling up the snapshot queue before the last tasks are added to the queue. This could mean that shards with no changes being scheduled after large file uploads?
This could mean that shards with no changes being scheduled after large file uploads?
We figured this was somewhat unlikely (to happen a lot) and outweigh by the fact that we now multi-thread the starting of shard snapshots (which is the main motivation for this change, as we realized that running this single-threaded could take a long time on large data nodes). But I guess if we want to be stricter about this, we could implement this differently by putting all the shards to start in a queue and then running threads (one of which would be the original thread that we fork from) that poll from that queue. That way we get even better ordering at the price of a little more code? Think it's worth it?
(Apologies for using too many vague/relative terms like many/few/large! But I need to somehow ask my question!)
For me one missing piece to evaluate the worst case impact of a file upload on unchanged shard snapshots, is the following: could the size of single files (to be uploaded) be extremely large? What would be a realistic "average" case? Or is it hard to tell since it could be wildly different depending on the settings/load? It seemed to me that many not-huge files to uploads is the common case. (Also another assumption from our talk, it seems often what happens during a snapshot is that most shards have no changes, and only few of them need to upload files.)
We generally try to not fill the queue with all tasks, for instance in BlobStoreRepository.snapshotShard
which only schedules as many threads as there are. I assume to allow other work to interleave?
If we do all shards in parallel, we risk getting #shards * thread_pool_size
into the queue. It seems like the "throttling" we do per shard is no longer useful then? This could mean a duration of many minutes if not hours of no response on the SNAPSHOT
thread pool. I think that could block restores, new followers and single-index snapshots? Am I missing something essential here?
I think this would be true also if we queue up the shard level execution and only do n jobs in parallel.
Am I missing something essential here?
You're not directly missing anything actually. But I think there is a tradeoff here. This change makes it so that running a snapshot on a 10k shard warm-node may block the snapshot pool for ~15 min (assuming ~0.5s per shard snapshot and 5 snapshot threads max) vs. taking 75min to run a single snapshot without blocking the pool. Taking the 15 min over the 75 min and accepting the blocking seems like the right tradeoff to me, especially when it's this easy to implement? Also, note that we are doing the delete side where we're not being clever about doing any sort of hand-crafted work stealing queue like we do for file uploads either and no complaints about it have come in ever as far as I'm aware :)
I think this would be true also if we queue up the shard level execution and only do n jobs in parallel.
True, the advantage of that approach would be that we'd have a harder guarantee on running the metadata work right before the file uploads (and it's a little faster potentially I guess).
I'm trying to understand how much of this discussion relates to the original issue (https://github.com/elastic/elasticsearch/issues/83408).
If blocking the snapshot threadpool during a snapshot is a concern, then what would be an acceptable approach that addresses the issue and avoids blocking the threadpool? One thread going through the shard snapshot tasks, calculating which snapshots do not upload a file, doing all of those in the same thread, and then again in the same thread going through the shard snapshots that do have a change, and only fork-off the file-uploads?
Basically, everything else other than file upload should be done on the thread going through the shards, and actually in two passes? No snapshotting of unchanged shards in parallel? Considering having many thousand unchanged shard snapshots and few shards actually uploading files is a normal occurrence, as Armin mentioned, parallelizing snapshotting of unchanged shards seems to reduce snapshot time, which I guess is an improvement.
My question is that if we do not want to parallelize shard snapshot tasks, then would the approach I mentioned above have any real improvement for the issue? Currently, all unchanged snapshots happen in the same thread, which doesn't seem that different than the two phase approach!
Armin and I had a brief conversation on this and while the existing mechanism for limiting the number of upload threads is somewhat broken, running the outer level tasks in parallel will slightly increase this brokennes. We prefer to instead fix this for good by maintaining enough data structures to just fill the snapshot pool with the right amount of work without exhausting it for potentially hours.
Something like 2 queues (the other level work and the actual file upload work) and a counter for how many jobs are active should do. When any job finishes it checks the outer level queue first and once that is depleted it does the actual file upload jobs.
This also ensures that all no-change shards are done prior to any file uploads. I think Armin intends to chat sync with you on this.
@elasticmachine please run elasticsearch-ci/part-1
(I think, the test failure was unrelated to the PR. I opened: https://github.com/elastic/elasticsearch/issues/88615)
@elasticmachine please update branch
@elasticmachine update branch
I have also undone the changes to SnapshotShardsService
in favour of https://github.com/elastic/elasticsearch/pull/88707.
@elasticmachine update branch
@henningandersen @original-brownbear Thank you for the reviews. Please have another look. I've addressed all comments.
I haven't added that case in Henning's code sample where we spawn directly if workers are available. IMO, it is an optimization for a very specific corner case which I don't know if we need. Always queueing and polling seems simpler (unless it has some disadvantage I'm not aware of).
(Test failure doesn't seem to be related to the PR: https://github.com/elastic/elasticsearch/issues/89623)
@elasticmachine run elasticsearch-ci/part-1
@elasticmachine update branch (There is a fix to one of the Snapshot ITs in main that was failing also in this PR)
I looked into the previous failures, since they were both related to snapshotting. From the previous test failures, it seems ConcurrentSnapshotsIT.testMasterFailoverOnFinalizationLoop
was failing due to recent changes on main, and is now fixed on main and doesn't fail here anymore. However, CloneSnapshotIT.testRemoveFailedCloneFromCSWithQueuedSnapshotInProgress
seems to rarely fail on this PR. Not sure why! I can make it time out on a slow system, but can't tell whether the test assumption doesn't match anymore with how we schedule the snapshotting tasks, or there is an issue in the PR. It seems that the snapshot cloning step of the test doesn't return and times out. I'd have to spend some more time on it.
CC: @original-brownbear maybe we could look into it together.
@henningandersen Thanks for the detailed feedback. I addressed all your comments. Please have another look. Meanwhile I'll look into that previous CI failure to see if it is related or I was just paranoid!
@henningandersen Thanks for the detailed feedback. I addressed all your comments. Please have another look. Meanwhile I'll look into that previous CI failure to see if it is related or I was just paranoid!
I am not able to reproduce that issue on the branch. For the record, it used to timeout on different asserts in the test, the couple of times that it happened (not just a specific one), and that was on a pretty slow system. I think, it is safe to dismiss that.
@original-brownbear @henningandersen All done! Please check again.
Thanks Henning and Armin!