scylla-manager
scylla-manager copied to clipboard
New parameter for the repair defining max nr of repair job per node.
We would like to introduce the possibility to increase the max number of repair jobs executed against a single scylla node using the repair task parameter.
Right now, we have a limit of just a single repair job per node. We want to leave it as a default, but the possibility to increase this value must be introduced. Parameter must accept integer values > 0. {1, 2, 3, ...}
Whenever user will set the max nr of repair jobs per scylla node to be higher than 0, then scylla-manager won't respect the limit defined by https://manager.docs.scylladb.com/stable/repair/index.html#maximal-effective-parallelism . The parallelism won't be narrow down to meet what is defined in the maximal-effective-parallelism definition.
Name of the parameter / flag -> max-jobs-per-node.
Is this the revert a of Manager 3.2 update ?
This release changes the parallelism and order of the repair job for better performance and stability. The following changes have been made: * Only one repair job is running on any host at any given time." ...
@tzach it's not the revert. Manager < 3.2 didn't respect any limits in terms of parallelism per node.
When we were discussing repair changes in 3.2, one of the ideas was to provide the flag that would tell what is the max number of the repair jobs that can be sent to a single node. See https://github.com/scylladb/scylla-manager/issues/3425#issuecomment-1636013035 . We didn't introduce the flag finally, but we decided to hard limit it to 1. See https://github.com/scylladb/scylla-manager/issues/3425#issuecomment-1663614205
Now, seems that some of the customers would like to repair more aggressively even tough they need to sacrifice the performance of the cluster during the repair. The statement above brings us back to the idea of the flag defining max parallel repair jobs per node.
As this is another parameter controlling repair speed, it should also be integrated with the repair control command.
What is the expected behavior when setting --parallel 3 and --max-jobs-per-node 3 in a cluster where the total maximal parallelism (with 1 job per 1 node rule) is 10?
From the basic flag description, it would look like scheduling 3 repairs on 1 replica set is allowed even though there would be other, underutilized replica sets.
So do we even want to allow setting --max-jobs-per-node when --parallel isn't set to 0? I would say that this does not make sense and if someone wanted to run repair faster, they should first set the --parallel 0 (which is the default) and only then try to increase --max-jobs-per-node.
if someone wanted to run repair faster, they should first set the --parallel 0 (which is the default) and only then try to increase --max-jobs-per-node.
Makes sense.
@asias please, take a look. IMO this feature makes no sense at all. AFAIU if you schedule more token ranges than max_parallel_ranges they would simply queue up and scylla won't allow more than max_parallel_ranges being repaired at the same time. Is this correct?
@asias please, take a look. IMO this feature makes no sense at all. AFAIU if you schedule more token ranges than max_parallel_ranges they would simply queue up and scylla won't allow more than max_parallel_ranges being repaired at the same time. Is this correct?
For example:
Assume max_repair_ranges_in_parallel = 5
repair job 1: 10 ranges to repair
At some point, job 1 could have finished 8 ranges, only 2 remaining ranges are being worked on.
If we allow max-jobs-per-node = 2, we could start repair job 2 with 10 ranges to repair, so could repair the extra 3 ranges in parallel which is allowed by max_repair_ranges_in_parallel.
@asias Do we want to put some hard limit on the max-jobs-per-node flag?
From this explanation it seems like setting it to let's say 10 is an overkill that could cause some unexpected failures on Scylla side.
@asias please, take a look. IMO this feature makes no sense at all. AFAIU if you schedule more token ranges than max_parallel_ranges they would simply queue up and scylla won't allow more than max_parallel_ranges being repaired at the same time. Is this correct?
For example:
Assume max_repair_ranges_in_parallel = 5
repair job 1: 10 ranges to repair
Why would we allow more ranges than max_repair_ranges_in_parallel? AFAIK - SM won't.
At some point, job 1 could have finished 8 ranges, only 2 remaining ranges are being worked on.
If we allow max-jobs-per-node = 2, we could start repair job 2 with 10 ranges to repair, so could repair the extra 3 ranges in parallel which is allowed by max_repair_ranges_in_parallel.
Hmm. I see. In such a case we should be achieving the maximum parallelism not by sending multi-range jobs like today but rather by sending (always) single-range jobs and allowing up to max_repair_ranges_in_parallel such jobs in parallel on every replica set.
This would mean changing the semantics of intensity again.
Let me clarify why I don't like adding yet-another parameter - it is going to create a convoluted spaghettish interface which in line is going to cause users' mistakes.
A set of:
- parallel (for controlling the number of replica sets repaired at the same time)
- intensity (for controlling the aggressiveness of a repair process on a single replica set)
represents a full and required set of configuration. And we can fix the issue we are trying to fix here without changing it.
@asias
@asias please, take a look. IMO this feature makes no sense at all. AFAIU if you schedule more token ranges than max_parallel_ranges they would simply queue up and scylla won't allow more than max_parallel_ranges being repaired at the same time. Is this correct?
For example: Assume max_repair_ranges_in_parallel = 5 repair job 1: 10 ranges to repair
Why would we allow more ranges than
max_repair_ranges_in_parallel? AFAIK - SM won't.
It does not only when it wants to have less than max_repair_ranges_in_parallel. If user wants max_repair_ranges_in_parallel, it can send more then max_repair_ranges_in_parallel ranges in a single job.
In theory, it is more efficient to send more ranges in a single job to reduce the overhead of the restful api request and wait, especially the amount of work is low when the amount of data is small per range.
Recently, we added ranges_parallelism parameter. To allow control the parallelism of ranges to repair by user without using the hack to send X ranges per job.
{
"name":"ranges_parallelism",
"description":"An integer specifying the number of ranges to repair in parallel by user request. If this number is bigger than the max_repair_ranges_in_parallel calculated by Scylla core, the smaller one will be used.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
At some point, job 1 could have finished 8 ranges, only 2 remaining ranges are being worked on. If we allow max-jobs-per-node = 2, we could start repair job 2 with 10 ranges to repair, so could repair the extra 3 ranges in parallel which is allowed by max_repair_ranges_in_parallel.
Hmm. I see. In such a case we should be achieving the maximum parallelism not by sending multi-range jobs like today but rather by sending (always) single-range jobs and allowing up to
max_repair_ranges_in_parallelsuch jobs in parallel on every replica set.This would mean changing the semantics of
intensityagain.
I would rather not to have such change.
- It is less efficient to send more jobs than less jobs. Especially, for small tables.
- It takes time for people to understand intensity = ranges to repair in parallel per job.
Let me clarify why I don't like adding yet-another parameter - it is going to create a convoluted spaghettish interface which in line is going to cause users' mistakes.
A set of:
* parallel (for controlling the number of replica sets repaired at the same time) * intensity (for controlling the aggressiveness of a repair process on a single replica set)represents a full and required set of configuration. And we can fix the issue we are trying to fix here without changing it.
@asias
I understand your concern of misuse.
Another options is allowing to have parallel larger than nr_nodes / rf. Allowing 1, 2, 3, .... nr_nodes /rf, guarantees we have at most 1 jobs per node. If we allow parallel = 2 * nr_nodes /rf, it means we allow 2 jobs per node. Btw, the parallel does not have to be N * nr_nodes/rf (N = 1, 2, ,3, ...), it can be any integer within [1, N*nr_nodes/rf].
This way, the meaning of both parallel and intensity remains while we allow max-jobs-per-node > 1 for some cases.
I think we had this idea in the paste but we focused on allowing 1 job per node, so we do not extend the parallel option.
Hi, just a user perspective chiming in, and I don't know if this is better addressed here by SM, or if this is a Scylla db issue, but we are seeing when running the latest SM on a small AWS cluster of 3 is4gen.xlarge, with RF 3 and about 2 TB data size, a full repair is taking over a week. I know that the is4gen family has a lower number of CPU shards available per TB of storage, so I'm wondering where the bottleneck is. It isn't clear to me as a user. Is the Scylla db side putting too low a limit on how much repair activity can be going at once? Our cluster load is only usually around 25% loaded, so the CPUs aren't overloaded. I have the repair job in SM set to use --intensity 0, so in theory it should use the max possible repair speed right?
Anyway, if you're wondering if users want a way to speed up repairs, the answer is yes. Thanks!
Another options is allowing to have parallel larger than nr_nodes / rf. Allowing 1, 2, 3, .... nr_nodes /rf, guarantees we have at most 1 jobs per node. If we allow parallel = 2 * nr_nodes /rf, it means we allow 2 jobs per node. Btw, the parallel does not have to be N * nr_nodes/rf (N = 1, 2, ,3, ...), it can be any integer within [1, N*nr_nodes/rf].
I like this approach. The 0 value would still mean the max amount of repair jobs that preserve 1 job per 1 host rule.
The only inconvenience would be that different keyspaces have replica sets of different size. E.g. in a 6 node cluster with 1 dc, repairing keyspace with RF 2 and parallel 6 would result in each node participating in 2 repair jobs at the same time. On the other hand, repairing keyspace with RF 6 and parallel 6 would result in each node participating in 6 repair jobs at the same time.
Of course, it's possible to schedule two separate repairs with different parallel values to solve this problem, but it adds some complexity for user to decide, how to split their repair task and how to schedule it.
Perhaps it would be better, if user could specify the % of all nodes that could take part in repair task at any time (with default value of 100% that could be increased to e.g. 200% if we wanted each node to participate in 2 repair jobs at once). Unfortunately, this would mean another change in parallel semantic, which is something that we want to avoid here.
@vladzcloudius @tzach are you ok with the semantic proposed by @asias?
The only inconvenience would be that different keyspaces have replica sets of different size. E.g. in a 6 node cluster with 1 dc, repairing keyspace with RF 2 and parallel 6 would result in each node participating in 2 repair jobs at the same time. On the other hand, repairing keyspace with RF 6 and parallel 6 would result in each node participating in 6 repair jobs at the same time.
The --parallel and --intensity flags could also be extended with per keyspace/table format
(similar to backup's --upload-parallel flag):
--parallel 'keyspace1:6,keyspace2:5,keyspace2.table1:10'
Where undefined keyspaces are repaired with default --parallel 0.
This would solve the problem mentioned above, but it requires feedback about the usefulness of this change.
@vladzcloudius @tzach are you fine with the approach suggested by Asias and with extending flags to support per keyspace/table values?
@asias please, take a look. IMO this feature makes no sense at all. AFAIU if you schedule more token ranges than max_parallel_ranges they would simply queue up and scylla won't allow more than max_parallel_ranges being repaired at the same time. Is this correct?
For example: Assume max_repair_ranges_in_parallel = 5 repair job 1: 10 ranges to repair
Why would we allow more ranges than
max_repair_ranges_in_parallel? AFAIK - SM won't.It does not only when it wants to have less than max_repair_ranges_in_parallel. If user wants max_repair_ranges_in_parallel, it can send more then max_repair_ranges_in_parallel ranges in a single job.
But scylla won't allow more than max_repair_ranges_in_parallel ranges in parallel so what's the point of sending more? It will only consume RAM while queuing up, no?
In theory, it is more efficient to send more ranges in a single job to reduce the overhead of the restful api request and wait, especially the amount of work is low when the amount of data is small per range.
The REST API overhead should be ignored because any single range repair (even for sparse tables) is going to take many orders of magnitude longer than the corresponding REST API overhead.
Recently, we added ranges_parallelism parameter. To allow control the parallelism of ranges to repair by user without using the hack to send X ranges per job.
{ "name":"ranges_parallelism", "description":"An integer specifying the number of ranges to repair in parallel by user request. If this number is bigger than the max_repair_ranges_in_parallel calculated by Scylla core, the smaller one will be used.", "required":false, "allowMultiple":false, "type":"string", "paramType":"query" },At some point, job 1 could have finished 8 ranges, only 2 remaining ranges are being worked on. If we allow max-jobs-per-node = 2, we could start repair job 2 with 10 ranges to repair, so could repair the extra 3 ranges in parallel which is allowed by max_repair_ranges_in_parallel.
Hmm. I see. In such a case we should be achieving the maximum parallelism not by sending multi-range jobs like today but rather by sending (always) single-range jobs and allowing up to
max_repair_ranges_in_parallelsuch jobs in parallel on every replica set. This would mean changing the semantics ofintensityagain.I would rather not to have such change.
* It is less efficient to send more jobs than less jobs. Especially, for small tables.
Not really. The problem we are discussing here is caused by the fact that we are repairing multiple ranges at the same time.
If we were always repairing a single range in a single repair task on the one hand and would always make sure there is intensity tasks running on every replica - this would achieve the the required parallelism on all replica at all times. This will also avoid the inefficiency you described above when multi-range repair task is used.
* It takes time for people to understand intensity = ranges to repair in parallel per job.
From the user perspective the semantics is not going to change:
parallel- number of replica sets repaired at the same time in the cluster.intensity- concurrency of a single replica set repair task.
Users can't care less about how SM is going to implement this semantics. ;)
Let me clarify why I don't like adding yet-another parameter - it is going to create a convoluted spaghettish interface which in line is going to cause users' mistakes. A set of:
* parallel (for controlling the number of replica sets repaired at the same time) * intensity (for controlling the aggressiveness of a repair process on a single replica set)represents a full and required set of configuration. And we can fix the issue we are trying to fix here without changing it. @asias
I understand your concern of misuse.
Another options is allowing to have parallel larger than nr_nodes / rf. Allowing 1, 2, 3, .... nr_nodes /rf, guarantees we have at most 1 jobs per node. If we allow parallel = 2 * nr_nodes /rf, it means we allow 2 jobs per node. Btw, the parallel does not have to be N * nr_nodes/rf (N = 1, 2, ,3, ...), it can be any integer within [1, N*nr_nodes/rf].
This way, the meaning of both parallel and intensity remains while we allow max-jobs-per-node > 1 for some cases.
I think we had this idea in the paste but we focused on allowing 1 job per node, so we do not extend the parallel option.
I don't understand what's the point of having more than max_repair_ranges_in_parallel ranges scheduled for a repair on a single node if scylla is not going to allow them run in parallel?
It's is only going to create an unnecessary queuing on a scylla side and won't get any speedup.
If I'm missing anything, please, speak up.
IMO not issuing multi-range tasks is the best way to control and ensure a specified repair concurrency.
@asias WDYT?