tskit Documentation/Tutorial for multithreading

Continuing from https://github.com/tskit-dev/tskit/pull/3077.

I think this link from numpy docs is a good starting point: https://numpy.org/doc/2.2/reference/random/multithreading.html

The bottom line is that one can execute multiple computations concurrently with concurrent.futures.ThreadPoolExecutor given that the computation-heavy parts of the program are free from GIL.

Jan 10 '25 10:01 hanbin973

Should this be in the tskit docs or the tutorials? If the latter, I guess it would come under the "parallelisation" tutorial mooted at https://github.com/tskit-dev/tutorials/issues/151#issuecomment-988672281?

Jan 10 '25 15:01 hyanwong

The most straightforward mode of parallelization is splitting the job over windows. After splitting, one can add the results (or average them by some weight) to get the final result. genetic_relatedness_vector falls into this category.

I've done some profiling and found that there is a good amount of overhead due to memory allocation for this strategy, especially in large problems. This can be avoided if we could pass a predefined array to the statistics functions and update the array "in-place" via +=. This requires to update the _tskitmodule.c to accept external arrays. The more lower-level C functions are already in-place functions, so they don't require much change. However, this might conflict with common practices in Python.

Any thoughts?

Edit: this might not be a big deal after all, at least for genetic_relatedness_vector because book keeping variables that are initialized inside the C functions are way bigger than the result array.

Jan 10 '25 17:01 hanbin973

Are you sure it's memory allocations here and not overhead associated with seeking along the sequence? I'd be surprised if malloc overhead was significant here

Jan 10 '25 19:01 jeromekelleher

To answer your question, yes, malloc does matter. Here's the result from seq_length=1e7 and num_individuals=1e4 where the weight matrix 100 dimensions.

However, I think it's not necessary to change any of the API because

The major malloc happens deeper in the C API and not _tskitmodule.c, so my initial speculation was wrong. It requires a lot of work.
The problem, to the extent that I'm aware of, is largely specific to genetic_relatedness_matrix because of the high-dimensional weights. In my particular application, this dimension can go up to tens of thousands. Most statistics won't require this much weights.

For each thread, it initializes two arrays of the size num_weights * num_nodes, totaling num_weights * num_nodes * num_threads.

Jan 10 '25 23:01 hanbin973