dashing2 icon indicating copy to clipboard operation
dashing2 copied to clipboard

large Metagenome comparison bad_alloc

Open jianshu93 opened this issue 2 years ago • 2 comments

Hello Danial,

I am comparing 100 metagenomes, total size about 1 TB, I assigned 2Tb memory but I always have the follow error after a few hours:

#Calling Dashing2 version v2.1.9 with command '/scratch/jianshu/interleaved_GWMC2/dashing2 dashing2 sketch --threads 24 --pminhash -k 21 -S 12000 AUBR2B_rmdup_trim_filter.interleaved.fa AUEP2AB_rmdup_trim_filter.interleaved.fa AUEP2AC_rmdup_trim_filter.interleaved.fa AUEP2BC_rmdup_trim_filter.interleaved.fa BRBH3C_rmdup_trim_filter.interleaved.fa CLVPT2_rmdup_trim_filter.interleaved.fa CNCD2C_rmdup_trim_filter.interleaved.fa CNCD4C_rmdup_trim_filter.interleaved.fa CNDL1AC_rmdup_trim_filter.interleaved.fa CNDL1BC_rmdup_trim_filter.interleaved.fa CNJN2C_rmdup_trim_filter.interleaved.fa CNJN4C_rmdup_trim_filter.interleaved.fa CNSH1C_rmdup_trim_filter.interleaved.fa CNSH2C_rmdup_trim_filter.interleaved.fa CNSH3C_rmdup_trim_filter.interleaved.fa CNSH4C_rmdup_trim_filter.interleaved.fa CNSH5C_rmdup_trim_filter.interleaved.fa CNSY3A_rmdup_trim_filter.interleaved.fa CNSY3B_rmdup_trim_filter.interleaved.fa CNSY3C_rmdup_trim_filter.interleaved.fa CNSZ1C_rmdup_trim_filter.interleaved.fa CNSZ2C_rmdup_trim_filter.interleaved.fa CNSZ3AB_rmdup_trim_filter.interleaved.fa CNSZ3AC_rmdup_trim_filter.interleaved.fa CNSZ4AC_rmdup_trim_filter.interleaved.fa CNWH1C_rmdup_trim_filter.interleaved.fa CNWH2C_rmdup_trim_filter.interleaved.fa CNWH4C_rmdup_trim_filter.interleaved.fa CNWX1AC_rmdup_trim_filter.interleaved.fa CNWX2C_rmdup_trim_filter.interleaved.fa CNWX3AC_rmdup_trim_filter.interleaved.fa CNWX3BC_rmdup_trim_filter.interleaved.fa CNWX4C_rmdup_trim_filter.interleaved.fa CNXA2C_rmdup_trim_filter.interleaved.fa CNXA4C_rmdup_trim_filter.interleaved.fa CNXM1C_rmdup_trim_filter.interleaved.fa CNXM3C_rmdup_trim_filter.interleaved.fa DEKS1B_rmdup_trim_filter.interleaved.fa ITLF2B_rmdup_trim_filter.interleaved.fa SAKB5_rmdup_trim_filter.interleaved.fa SEGL1C_rmdup_trim_filter.interleaved.fa SESD1C_rmdup_trim_filter.interleaved.fa TWKS2C_rmdup_trim_filter.interleaved.fa TWTN2C_rmdup_trim_filter.interleaved.fa USAG1C_rmdup_trim_filter.interleaved.fa USAG2C_rmdup_trim_filter.interleaved.fa USAK1D2_rmdup_trim_filter.interleaved.fa USAK1D3_rmdup_trim_filter.interleaved.fa USAT02C_rmdup_trim_filter.interleaved.fa USAT04C_rmdup_trim_filter.interleaved.fa USBT2C_rmdup_trim_filter.interleaved.fa USBT5C_rmdup_trim_filter.interleaved.fa USCB1AC_rmdup_trim_filter.interleaved.fa USCB1CC_rmdup_trim_filter.interleaved.fa USCG2C_rmdup_trim_filter.interleaved.fa USCG3C_rmdup_trim_filter.interleaved.fa USCG4AC_rmdup_trim_filter.interleaved.fa USCG4BC_rmdup_trim_filter.interleaved.fa USDC2C_rmdup_trim_filter.interleaved.fa USFT4C_rmdup_trim_filter.interleaved.fa USFT5C_rmdup_trim_filter.interleaved.fa USHS3A_rmdup_trim_filter.interleaved.fa USKN1AB_rmdup_trim_filter.interleaved.fa USMD1C_rmdup_trim_filter.interleaved.fa USMD2C_rmdup_trim_filter.interleaved.fa USMD3C_rmdup_trim_filter.interleaved.fa USMD4C_rmdup_trim_filter.interleaved.fa USMI1C_rmdup_trim_filter.interleaved.fa USMI4C_rmdup_trim_filter.interleaved.fa USNO2D13_rmdup_trim_filter.interleaved.fa USOK01C1_rmdup_trim_filter.interleaved.fa USOK03A1_rmdup_trim_filter.interleaved.fa USOK06C_rmdup_trim_filter.interleaved.fa USOP1AC_rmdup_trim_filter.interleaved.fa USOP1BC_rmdup_trim_filter.interleaved.fa USPT1C_rmdup_trim_filter.interleaved.fa USPT3C_rmdup_trim_filter.interleaved.fa USRE6C_rmdup_trim_filter.interleaved.fa USSD2BC_rmdup_trim_filter.interleaved.fa USTE3B_rmdup_trim_filter.interleaved.fa USTF1AC_rmdup_trim_filter.interleaved.fa USTF1BC_rmdup_trim_filter.interleaved.fa USVA2C_rmdup_trim_filter.interleaved.fa USVA3C_rmdup_trim_filter.interleaved.fa USVA5C_rmdup_trim_filter.interleaved.fa USVA6C_rmdup_trim_filter.interleaved.fa USVA8C_rmdup_trim_filter.interleaved.fa USVA9C_rmdup_trim_filter.interleaved.fa USVD1AC_rmdup_trim_filter.interleaved.fa USVD1BC_rmdup_trim_filter.interleaved.fa USVD1CC_rmdup_trim_filter.interleaved.fa USVD1DC_rmdup_trim_filter.interleaved.fa USWR1BC_rmdup_trim_filter.interleaved.fa USWR2BC_rmdup_trim_filter.interleaved.fa UYUC06_rmdup_trim_filter.interleaved.fa --cmpout ../GWMC_pminhash.txt' terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc

It seem large sketch size for metagenomes takes a lot of memory. Do you have some suggestions for this?

Many Thanks,

Jianshu

jianshu93 avatar Jan 29 '22 17:01 jianshu93

Hi Jianshu,

Thanks for the issue!

I think that what you're running into is out-of-memory errors when computing the k-mer count map before building the ProbMinHash sketch. You can add --countsketch-size <number>, where the number is somewhere around 500k to 50 million, which can reduce the memory footprint of building the sketches. I recommend starting around 5 million.

The other place to reduce memory usage is to use -o <path>, which causes the signatures to be mmap'd. But I would not expect 1000 sketches to run out of memory. I expect that your problem is more likely in k-mer counting.

Thanks,

Daniel

dnbaker avatar Jan 31 '22 19:01 dnbaker

Hi Daniel,

Many thanks for the message! I will try and let you know what i get.

Jianshu

jianshu93 avatar Feb 01 '22 00:02 jianshu93