algebra
algebra copied to clipboard
More fine-grained control over parallelism
Something that has always irked me about using something like rayon is that as part of a larger application, it is not really possible to specify dynamically how many threads to allocate to particular functions.
Although the major computational workloads are centered around proving (and batch scalar muls for powers of tau/group FFTs) perhaps more fine-grained control would be nice for something like mobile-phone based provers, which have many weak cores (typically 6-8).
Still, in general the OS is quite good at load balancing if you ask for more resources than you strictly need, so maybe this is me just being a control freak :sweat_smile:
You can do something like this:
let pool = rayon::ThreadPoolBuilder::new().num_threads(rayon::current_num_threads() / 2).build().unwrap();
let result = pool.install(|| f());
Not sure if this is what you're talking about. @Pratyush and I have discussed creating a nicer abstraction for handling use cases like this but haven't actually gotten to designing something yet. If you have any ideas how to do this, that would be great!
Figuring out a solution here is also probably quite helpful for FFTs, since we shouldn't expect to achieve a truly linear relation between number of cores and time taken to execute an FFT at high core counts. (This is due to some problems with cache locality on hierchical memory arch's)
So if two FFTs could be done in parallel, each using half the available cores, this would be better than doing each FFT with all the cores one after another.
let pool = rayon::ThreadPoolBuilder::new().num_threads(num_threads).build().unwrap();
let result = pool.install(|| f());
Basically the snippet @ryanleh posted allows this sort of fine-grained control: it limits all rayon calls within f to use only num_threads number of threads.
The only possible drawback is that the trick probably makes most sense at the SNARK level, and can't easily be done any lower (unless we change our APIs to pass in a num_threads argument).
We can probably write a nice wrapper that takes a list of tasks (expressed as closures), and automatically invokes the thread limitation as desired.
I'm just not sure what the behaviour of rayon is aware of the parent threadpools. For instance, it is plausible that if one calls par_iter in the context of a parent par_iter, rayon will balance the load between the two. But I'm clear of how it chooses to balance the load. [Insert results from research here...]
Hence, I think it may be productive to think of a generic interface that allows one to specify the number of threads, which can feed into a macro via a variable. This could be a simple n_threads: Option<usize> parameter that is managed from the calling function. For FFT, one could have some logic at the top level calling function to determine the total number of cores available and do the partition judiciously.
Let's use this issue to continue to brainstorm use cases, ideas and examples in other codebases.
par_iter does not create a new pool, however, inside a pool, cur_num_threads() returns the number of threads the pool was initialized with
good talks: rayon scheduler
alternatives to rayon: https://github.com/TimelyDataflow/timely-dataflow