sga icon indicating copy to clipboard operation
sga copied to clipboard

sga index segfault with large values of -d

Open sjackman opened this issue 7 years ago • 11 comments

The command sga index -d 20000000 -t 64 hsapiens.preprocess.filter.pass.merged.fa segfaults with -d 20000000. Reducing to -d 1000000 works. Is each BWT batch size limited in size, perhaps to 2 or 4 billion nucleotides? -d 20000000 with a mean sequence size of ~300 bp should correspond to a batch size of about 6 Gbp.

sjackman avatar Nov 03 '16 21:11 sjackman

Can sga index -a ropebwt work with the output of sga fm-merge? The mean sequence size is 300 bp, and the largest sequence is 30,889 bp.

sjackman avatar Nov 03 '16 21:11 sjackman

Did you run out of memory with -d 20000000? Without -a ropebwt a memory inefficient algorithm is used. There is no 2 (or 4) billion nucleotide batch limit.

jts avatar Nov 03 '16 21:11 jts

Whether it is worth using -a ropebwt depends on the read length distribution. I suggest sticking with the recommended parameters (not ropebwt, -d X). It shouldn't take very long.

jts avatar Nov 03 '16 22:11 jts

The fm-merge FASTA file is 20 GB, so it should be possible to construct the BWT in a single pass using SAIS in roughly 200 GB RAM. I reported this issue because of the segfault, which is 😢. I'm happy with the -d 1000000 workaround though.

Did you run out of memory with -d 20000000?

I don't believe so. It was using 76 GB of RAM when it crashed, and the machine has 2.5 TB available.

It shouldn't take very long.

I'm using sga index -d 1000000 now. It has finished 41 of 69 batches in four hours, so it's trucking along nicely. 🏎

sjackman avatar Nov 03 '16 23:11 sjackman

Have you read Optimal In-Place Suffix Sorting? https://arxiv.org/abs/1610.08305 It seems worth checking out. @rob-p brought it to my attention.

sjackman avatar Nov 03 '16 23:11 sjackman

sga index -d 1000000 completed in 25 hours.

sga index -d 1000000 -t 64 hsapiens.preprocess.filter.pass.merged.fa
205964.05s user 3080.39s system 232% cpu 24:56:18.90 total 9111 MB

sjackman avatar Nov 07 '16 18:11 sjackman

Thanks for the update. I did see that paper from @rob-p's twitter - its on my to-read list :)

jts avatar Nov 07 '16 18:11 jts

Here's the wallclock and memory results for SGA on human HG004 data with and without fm-mege. (a memo to self and for future curious readers)

fm-merge Wallclock (h) Peak Memory (GB)
FALSE 65.4 270.35938
TRUE 65.0 82.24316

sjackman avatar Nov 07 '16 19:11 sjackman

Interesting, thanks! I wouldn't have expected the runtimes to be (nearly) the same, but it is good to see.

jts avatar Nov 08 '16 00:11 jts

It was surprising to me to. Running fm-merge first speeds up overlap and assemble quite a bit. I found that rmdup after fm-merge didn't remove any sequences. Is it necessary, or did I just get lucky?

sjackman avatar Nov 08 '16 01:11 sjackman

sga index -d 1000000 succeeded. sga index -d 10000000 succeeded. sga index -d 20000000 segfaulted.

sjackman avatar Nov 09 '16 19:11 sjackman