somalier icon indicating copy to clipboard operation
somalier copied to clipboard

Max number of samples?

Open jjfarrell opened this issue 9 months ago • 3 comments

Would somalier work on 70k samples? Is there a maximum number of samples that would be practical to run somalier?

jjfarrell avatar Feb 19 '25 13:02 jjfarrell

70K will probably be pushing it. It is still n^2 in memory and time, it's just fast and efficient. I would think 20K would be OK, but please let me know what you get to.

brentp avatar Feb 19 '25 15:02 brentp

I just checked to see what gnomAD was using for relationship inference. They are using the software tool cuking whose algorithm was actually based on Somalier but implemented using Nvidia GPUs to run on the Google cloud.

We have some GPUs nodes on our HPCC at BU so I will first try to get cuking running on our HPCC. This is the approach cuking uses to address the memory issues which is limited on GPUs.

If the number of samples and sites is so large that they won't fit into the memory of a single GPU (40 GB for a2-highgpu-1g machines), the computation can be sharded. Sharding works by splitting the full relatedness matrix into submatrices that are computed independently, so the results can be easily combined afterwards.

For example, to halve memory requirements, the full matrix can be split into 4⋅4=16 equally sized submatrices (i.e. a "split factor" of 4). Only the "upper triangular" submatrices need to be evaluated due to symmetry of relatedness, leading to 10 shards.

gnomAD v4.0 used cuking for 955,000 samples and it ran in 1.5 hours ($243)! It is not clear how many GPUs were used. The combination of GPUs and the fast and efficient Somalier algorithm is quite amazing.

jjfarrell avatar Feb 20 '25 13:02 jjfarrell

I didn't know about cuking, that's very cool! Indeed it seems like a good problem for GPU

brentp avatar Feb 20 '25 17:02 brentp