bwa-mem2 icon indicating copy to clipboard operation
bwa-mem2 copied to clipboard

Range violation

Open christopher-schroeder opened this issue 5 years ago • 19 comments

Hello, I have a rather difficult problem. I am aligning several wgs samples, each about 100gb filesize, and for one of them bwa-mem2 avx just stops at 17gb without any error. bwa-mem2 sse4.1 stops at the same position with the following error.

[0000] 3. Calling kt_for - worker_sam
        [0000][ M::mem_process_seqs] Processed 5868668 reads in 2703.295 CPU sec, 32.876 real sec
[0000] read_chunk: 880000000, work_chunk_size: 880000219, nseq: 5868866
        [0000][ M::kt_pipeline] read 5868866 sequences (880000219 bp)...
[0000] Calling mem_process_seqs.., task: 34
[0000] 1. Calling kt_for - worker_bwt
core.exception.RangeError@BioD/bio/std/hts/bam/read.d(226): Range violation

Unfortunately I cannot provide a minimum example. If I reduce the fastq files to just a million around this "critical point" it runs perfectly fine and produces the expected values. I am willing to debug something, if you would like me to.

Best Christo

ps: bwa-mem (the normal old version) is able to process the sample without any errors.

christopher-schroeder avatar Aug 10 '20 19:08 christopher-schroeder

The error message is raised by the BioD package, bwa itself doesn't read or write bam files. Although you'd expect the surrounding tools to behave the same are you sure the issue is with bwa-mem2 itself and not something feeding in to or reading from bwa?

keiranmraine avatar Aug 11 '20 08:08 keiranmraine

I will rerun and pipe to /dev/null to see what happens

christopher-schroeder avatar Aug 11 '20 08:08 christopher-schroeder

If you have sufficient space I'd recommend the following while determining if BWA is the source of the issue

bwa-mem .... | gzip -c -2 > sam.gz

Then you can also see if the issue remains in the rest of the flow via:

zcat sam.gz | <other tools>

This may also help identify if an individual read is malformed by slicing up that file later. Possibly also worth determining if the remainder of the flow works with the samtools implementations of the downstream processing (if possible).

keiranmraine avatar Aug 11 '20 08:08 keiranmraine

I used pigz to speed up the compression process. The command was

(bwa-mem2 mem -t 88 -R '@RG\tID:1875-0_Blood\tSM:1875-0_Blood\tPL:ILLUMINA' resources/genome.fasta results/merged/1875-0_Blood_R1.fastq.gz results/merged/1875-0_Blood_R2.fastq.gz | pigz -c > 1875-0_Blood.sam.gz) 2> 1875-0_Blood.log

I counted

zcat 1875-0_Blood.sam.gz | wc -l

and got 195,759,052 lines.

A count of the bwa aligned file

sambamba view -c 1875-0_Blood.bwa.bam

shows 2,239,533,143 aligned reads (including split map secondary reads).

The 1875-0_Blood.log tail:

[0000] Calling mem_process_seqs.., task: 33
[0000] 1. Calling kt_for - worker_bwt
[0000] 2. Calling kt_for - worker_aln
[0000] Inferring insert size distribution of PE reads from data, l_pac: 3099750718, n: 5868668
[0000][PE] # candidate unique pairs for (FF, FR, RF, RR): (205, 2675344, 7, 190)
[0000][PE] analyzing insert size distribution for orientation FF...
[0000][PE] (25, 50, 75) percentile: (617, 933, 1604)
[0000][PE] low and high boundaries for computing mean and std.dev: (1, 3578)
[0000][PE] mean and std.dev: (1075.27, 744.14)
[0000][PE] low and high boundaries for proper pairs: (1, 4565)
[0000][PE] analyzing insert size distribution for orientation FR...
[0000][PE] (25, 50, 75) percentile: (284, 349, 421)
[0000][PE] low and high boundaries for computing mean and std.dev: (10, 695)
[0000][PE] mean and std.dev: (353.79, 106.57)
[0000][PE] low and high boundaries for proper pairs: (1, 832)
[0000][PE] skip orientation RF as there are not enough pairs
[0000][PE] analyzing insert size distribution for orientation RR...
[0000][PE] (25, 50, 75) percentile: (775, 1103, 1737)
[0000][PE] low and high boundaries for computing mean and std.dev: (1, 3661)
[0000][PE] mean and std.dev: (1194.15, 693.38)
[0000][PE] low and high boundaries for proper pairs: (1, 4623)
[0000][PE] skip orientation FF
[0000][PE] skip orientation RR
[0000] 3. Calling kt_for - worker_sam
        [0000][ M::mem_process_seqs] Processed 5868668 reads in 1847.586 CPU sec, 22.723 real sec
[0000] read_chunk: 880000000, work_chunk_size: 880000219, nseq: 5868866
        [0000][ M::kt_pipeline] read 5868866 sequences (880000219 bp)...
[0000] Calling mem_process_seqs.., task: 34
[0000] 1. Calling kt_for - worker_bwt

and tail of the sam.gz:

H3NGLALXX:5:23:4769918:0        99      2       25903692        60      150M    =       25904068        502     TTTCCTTATTAAAAGTTCCAAATATAGGCCAGGCGCGGTGGCTCATGCCTATAATGCCAGCACTTTGGGAGGTCAAGGCAGATGGATCACGAGGTGAGCTGTTCGAGACCAGCCTGGCTAACATGATGAAATCCTATCTCTACTAAAACT  AAAFFKKFKAAKKKKKKKKKKKKKKKKKKKKKK<AKKKKAKAKKKKKKFK<FK<FKKFKKFKKA<FKKK<FFAAAAKFKKKAFKKFKFFKAAF<A,<7FAK,,,<<<K7A<7<,7AFFAFA<77AA,A,,,A<A7,,,,,,,,,<<KA,,  NM:i:2  MD:Z:95C52A1    MC:Z:24S126M    AS:i:143        XS:i:54 RG:Z:1875-0_Blood
H3NGLALXX:5:23:4769918:0        147     2       25904068        60      24S126M =       25903692        -502    AAAGCAGTACAATGGGCGAGCCGGGGTGGCTCACGCCTATAATCCCACCGCTTTGGGAGGTCAAGGCAGGGGGATCACCTGAGGTCAGGAGTTCAAGACCAGCCTGGCCAACCCATCTCTACTAAAAATACAAAAATTAGCCACGTGTGG  ,,,,,7,,<,,,(,,,,(,(,,,(7,<FAFKAA<,,,,,,AA7,<(,,,,,,,<(,<A,,,,,,(77,(,,AAKAFKFAA<KF<,KF7AFKF,FKKFKFA7,AFKAKAA<,7,,FF77A,KA,KKKFKF<FKKKKF7<7A7F7,,,,A,<  NM:i:5  MD:Z:11T2C10A20T72G6    MC:Z:150M       AS:i:101        XS:i:76 RG:Z:1875-0_Blood
H3NGLALXX:5:7:3873948:0 99      2       25903858        60      150M    =       25904071        353     GAATGGTGGTGCACGCCTGTAATCTCAGCTACTCAGGAGGCTGAGGCAGGAGAATCACCTGAACCCAGGAGGTGGAGGTTGCAGTGGGCCGAGACTGTACCACTGCACTCCAGCCTGGACAACAGAGTGAGACTCCGTCTCAAAAAAAAA  <AFFFAAFKK<,FKK<AKKFKFKKKKKKFKKKFFKKKKKFFFKKFF7A7FFA<FKKK7FKKKKKKKAAAKKKAAFA,AAFFF,AFAAKKFFAAFFKK<FKKFK7FKKKKKKFKKFFAF<F<KKKF<F<F<F<A77FAAKAKKK<AK<<FF  NM:i:1  MD:Z:66G83      MC:Z:10S140M    AS:i:145        XS:i:78 RG:Z:1875-0_Blood
H3NGLALXX:5:7:3873948:0 147     2       25904071        60      10S140M =       25903858        -353    CGAGGCTTGGGGCTCACGCCTCGAATCCCGCCACTTTGGGAGGTCAAGGCAGGTGGATCACCGGAGGTGAGGAGTTCTAGACCAGCCTGGCCAACCGATCTCTGCTAAAAATACAAAAATTAGCCAGGTGTGGTGGCATGCGCCTGTAGT  ,,7<,,,(((,A<,,,(,,,,,,,,,<7(,A7,<,,,FFAAFAKA,,<A,F7<7<7A<AA<<,,,<,,,AFA<FFA,,7A,7FAA<(7FFA7,7A7,,7KKA<,FF,<7,F<<AAK7FKA7AF7FFAA,KKKKKKKF7((7A<A,7,,,,  NM:i:8  MD:Z:8T3T6A32T5C8A18C6A46       MC:Z:150M       AS:i:100        XS:i:64 RG:Z:1875-0_Blood
H3NGLALXX:1:23:4404940:0        83      2       25904072        60      150M    =       25903867        -355    GCTCACGTCTCTAATCCCACCACTTTGGGAGGTCAAGGCAGGTGGATCACCTGAGGTCAGGAGTTCAAGACCAGCCTGGCCAACCCATCTCTACTAAAAATACAAAAATTAGCCAGGTGTGGTGGCATGCGCCTGTAGTCCTAGCTACTC  KKKFA<FAFFFKAKAFKKKKF<FK<<<KKKAKKKFKFKKKKKF<AKK<AKKKKKKKKKKKKKKKKKKKK<KKKKKKKKKKKKKKKKKFKKKKKKKKKKKKKFKKKKKKKKKAKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK

Please note, that the second read of H3NGLALXX:1:23:4404940:0 is missing. You probably ask yourself how the second read looks like. Here the read plus the read before and after.

@H3NGLALXX:5:7:3873948:0/2
ACTACAGGCGCATGCCACCACACCTGGCTAATTTTTGTATTTTTAGCAGAGATCGGTTGGCCAGGCTGGTCTAGAACTCCTCACCTCCGGTGATCCACCTGCCTTGACCTCCCAAAGTGGCGGGATTCGAGGCGTGAGCCCCAAGCCTCG
+
,,,,7,A<A7((7FKKKKKKK,AAFF7FA7AKF7KAA<<F,7<,FF,<AKK7,,7A7,7AFF7(<AAF7,A7,,AFF<AFA,,,<,,,<<AA<A7<7<7F,A<,,AKAFAAFF,,,<,7A,(7<,,,,,,,,,(,,,<A,(((,,,<7,,
@H3NGLALXX:1:23:4404940:0/2
TGCACGCCTGTAATCTCAGCTACTCAGGAGGCTGAGGCAGGAGAATCACCTGAACCCAGGAGGTGGAGGTTGCAGTGGGCCGAGACTGTACCACTGCACTCCAGCCTGGACAACAGAGTGAGACTCCGTCTCAAAAAAAAAAAAAAAAAA
+
AAFFFKKKKKKKKKKKKKKKKKKKKKKKFKKKKKKKKKKKKKKFKKKKKKKKKKKKKKKKKKKFKKFKKKKKKKKKKKKKKF<FAFKKKKKKKKKKKFK<FKFKKFKAF<FKAFFKKKAAAFAFAFFK,,7A,AA<<77<,<A77F7A7A
@H3NGLALXX:4:2:1918254:0/2
TTAGCCGGGAATGGTGGTGCACGCCTGTAATCTCAGCTACTCAGGAGGCTGAGGCAGGAGAATCACCTGAACCCAGGAGGTGGAGGTTGCAGTGGGCCGAGACTGTACCACTGCACTCCAGCCTGGACAACAGAGTGAGACTCCGTCTCA
+
A<,FF7FFK7AFKKKKKKKKFKKFFKKKK7FFAKKKKKKKKA7FKFKKKKKKKKKKKKKK7AAFAKKKFKKFKKFF<AAFAKKFFK,AFFAF7AAKFFFFFF<<KFKK<FA<KF7<AFFKKKF<<7KFK<KKKFK<AAFKK,AAKA7,A,

christopher-schroeder avatar Aug 11 '20 20:08 christopher-schroeder

Can you run with -v 4:

   -v INT        verbose level: 1=error, 2=warning, 3=message, 4+=debugging [3]

The last read of the sam.gz file is missing the AUX tags, a possible place to investigate for the devs.

keiranmraine avatar Aug 12 '20 09:08 keiranmraine

(/dev/null for bwa stdout)

keiranmraine avatar Aug 12 '20 09:08 keiranmraine

Running, but it will take a lot of time. The debug mode is very slow.

christopher-schroeder avatar Aug 12 '20 09:08 christopher-schroeder

Another way you can check for undefined behaviour is to compile with -fsanitize=address. This is a bit faster but may catch if the issue is caused by a bad memory read, you can breakpoint on __asan_on_error@plt to catch the error before it writes over your stack. The final alternative is valgrind which should work on sse41 but is rather slow.

mp15 avatar Aug 12 '20 14:08 mp15

I dont even know what that means OO

christopher-schroeder avatar Aug 12 '20 16:08 christopher-schroeder

here is tail -n from the debug log 1875-0_Blood.v4.tail.log

christopher-schroeder avatar Aug 14 '20 09:08 christopher-schroeder

@christopher-schroeder are you still facing the problem? What is the memory capacity of this system? can you try and check with smaller number of threads?

yuk12 avatar Aug 26 '20 14:08 yuk12

yes, the error remains. I had no choice and switched back to bwa mem for the project and everything was mapped and downstream analysis are currently running. The memory capacity of the system is 756gb, that should be no issue. Sure, what number do you have in mind @yuk12? 44?

christopher-schroeder avatar Aug 31 '20 16:08 christopher-schroeder

The available memory is sufficient to run mem2. You can start with 16 threads. The log shows that it is hanged in the seeding stage. Also, you can try the latest commit, where index size is reduced significantly, and see.

yuk12 avatar Sep 09 '20 02:09 yuk12

I tried 16 cores with the newest commit and it worked. I increase it to 32 now.

christopher-schroeder avatar Sep 12 '20 21:09 christopher-schroeder

with 32 threads a get the following error:

bwa-mem2: src/bwamem.cpp:904: int mem_kernel1_core(FMI_search*, const mem_opt_t*, bseq1_t*, int, mem_chain_v*, mem_seed_t*, int64_t, mem_cache*, int): Assertion `num_smem < *wsize_mem' failed.

christopher-schroeder avatar Sep 13 '20 20:09 christopher-schroeder

@christopher-schroeder Do you have any updates for this issue? I am also running into the same problem.

skchronicles avatar Sep 07 '22 14:09 skchronicles

I dont know the technical background but I now used -K 500000000 as parameter for a few month on 100+ whole genomes and never got this error again. So I guess this is at least a workaround. @skchronicles

christopher-schroeder avatar Sep 07 '22 14:09 christopher-schroeder

@christopher-schroeder Thank you for the quick reply! I appreciate. I will let you know if this also fixes our issue too.

skchronicles avatar Sep 07 '22 15:09 skchronicles

@skchronicles Did it work?

christopher-schroeder avatar Oct 05 '22 15:10 christopher-schroeder