cgmlst-dists icon indicating copy to clipboard operation
cgmlst-dists copied to clipboard

Problem with large input

Open adipi71 opened this issue 9 months ago • 2 comments

we have investigated the issue with cgmls-dists in handling large input files (the error has been reported with 80k Lm samples) . The tool goes in segmentation fault. The bug is due to an incorrect memory allocation for the distance vector. The memory size is calculated as nrownrow which generates an Integer Overflow for a large nrow and using 32 bits (line 219 on the original version). The maximum value that can be stored in an int variable is 2147483647 (in our case, the final dist vector size might be 8000080000 = 6.400.000.000 > 2.147.483.647). This is due to the fact that the tool uses a vector and treats it as a matrix, which is a nice optimization.

We just imported the inttypes.h library to bypass the overflow using 64 bits. We have successfully tested on 80,000 samples and 1,748 loci.

We look forward to your feedback on this. Best Adriano

adipi71 avatar Sep 26 '23 10:09 adipi71