tskit icon indicating copy to clipboard operation
tskit copied to clipboard

Documentation/Tutorial for multithreading

Open hanbin973 opened this issue 10 months ago • 4 comments

Continuing from https://github.com/tskit-dev/tskit/pull/3077.

I think this link from numpy docs is a good starting point: https://numpy.org/doc/2.2/reference/random/multithreading.html

The bottom line is that one can execute multiple computations concurrently with concurrent.futures.ThreadPoolExecutor given that the computation-heavy parts of the program are free from GIL.

hanbin973 avatar Jan 10 '25 10:01 hanbin973

Should this be in the tskit docs or the tutorials? If the latter, I guess it would come under the "parallelisation" tutorial mooted at https://github.com/tskit-dev/tutorials/issues/151#issuecomment-988672281?

hyanwong avatar Jan 10 '25 15:01 hyanwong

The most straightforward mode of parallelization is splitting the job over windows. After splitting, one can add the results (or average them by some weight) to get the final result. genetic_relatedness_vector falls into this category.

I've done some profiling and found that there is a good amount of overhead due to memory allocation for this strategy, especially in large problems. This can be avoided if we could pass a predefined array to the statistics functions and update the array "in-place" via +=. This requires to update the _tskitmodule.c to accept external arrays. The more lower-level C functions are already in-place functions, so they don't require much change. However, this might conflict with common practices in Python.

Any thoughts?

  • Edit: this might not be a big deal after all, at least for genetic_relatedness_vector because book keeping variables that are initialized inside the C functions are way bigger than the result array.

hanbin973 avatar Jan 10 '25 17:01 hanbin973

Are you sure it's memory allocations here and not overhead associated with seeking along the sequence? I'd be surprised if malloc overhead was significant here

jeromekelleher avatar Jan 10 '25 19:01 jeromekelleher

To answer your question, yes, malloc does matter. Here's the result from seq_length=1e7 and num_individuals=1e4 where the weight matrix 100 dimensions. image

However, I think it's not necessary to change any of the API because

  1. The major malloc happens deeper in the C API and not _tskitmodule.c, so my initial speculation was wrong. It requires a lot of work.
  2. The problem, to the extent that I'm aware of, is largely specific to genetic_relatedness_matrix because of the high-dimensional weights. In my particular application, this dimension can go up to tens of thousands. Most statistics won't require this much weights.

For each thread, it initializes two arrays of the size num_weights * num_nodes, totaling num_weights * num_nodes * num_threads.

hanbin973 avatar Jan 10 '25 23:01 hanbin973