tskit
tskit copied to clipboard
Documentation/Tutorial for multithreading
Continuing from https://github.com/tskit-dev/tskit/pull/3077.
I think this link from numpy docs is a good starting point: https://numpy.org/doc/2.2/reference/random/multithreading.html
The bottom line is that one can execute multiple computations concurrently with concurrent.futures.ThreadPoolExecutor given that the computation-heavy parts of the program are free from GIL.
Should this be in the tskit docs or the tutorials? If the latter, I guess it would come under the "parallelisation" tutorial mooted at https://github.com/tskit-dev/tutorials/issues/151#issuecomment-988672281?
The most straightforward mode of parallelization is splitting the job over windows. After splitting, one can add the results (or average them by some weight) to get the final result. genetic_relatedness_vector falls into this category.
I've done some profiling and found that there is a good amount of overhead due to memory allocation for this strategy, especially in large problems. This can be avoided if we could pass a predefined array to the statistics functions and update the array "in-place" via +=. This requires to update the _tskitmodule.c to accept external arrays. The more lower-level C functions are already in-place functions, so they don't require much change. However, this might conflict with common practices in Python.
Any thoughts?
- Edit: this might not be a big deal after all, at least for
genetic_relatedness_vectorbecause book keeping variables that are initialized inside the C functions are way bigger than the result array.
Are you sure it's memory allocations here and not overhead associated with seeking along the sequence? I'd be surprised if malloc overhead was significant here
To answer your question, yes, malloc does matter. Here's the result from seq_length=1e7 and num_individuals=1e4 where the weight matrix 100 dimensions.
However, I think it's not necessary to change any of the API because
- The major malloc happens deeper in the C API and not
_tskitmodule.c, so my initial speculation was wrong. It requires a lot of work. - The problem, to the extent that I'm aware of, is largely specific to
genetic_relatedness_matrixbecause of the high-dimensional weights. In my particular application, this dimension can go up to tens of thousands. Most statistics won't require this much weights.
For each thread, it initializes two arrays of the size num_weights * num_nodes, totaling num_weights * num_nodes * num_threads.