duplex-tools icon indicating copy to clipboard operation
duplex-tools copied to clipboard

issue with split_on_adapter output

Open itslittman opened this issue 2 years ago • 17 comments

I successfully ran split_on_adapter with the --allow-multiple-splits option and it split a bunch of reads for me. When converting the new mapped SAM file to BAM format, it wrote about 570mb successfully and then failed with this error:

[E::aux_parse] B aux field type not followed by ',' [W::sam_read1_sam] Parse error at line 413764 samtools view: error reading file "sample_name.mapped.sam"

My workflow:

  1. basecall with Dorado (400bps DNA HAC w/5mC and 5hmC)
  2. convert unmapped SAM to FASTQ with samtools fastq -T '*'
  3. split_on_adapter with --allow-multiple-splits option
  4. map with minimap2
  5. convert output to BAM

This is the offending line:

789cdf07-b995-461e-84fa-bf4ef4c62f1e_1 16 chr1 151237868 60 157M1D3M1D73M35S * 0 0 ATTCCCAAAATTGCTGAGTAGTGGCAATTTTAGATTCTCTTTGGTGGAATCAGAGTGGAAGAGGTAGGCAAGAAGATTTGGAGAAAACTAGATTATAATACATACTGTAGAGAGTTCCTGGGGTTAGAGGAAGGATCTCATTTTCTCCTGTTTTTTTATGATTTTTTTCTCTTTTTGTTTTCTTGATCACTTATTATCTGACCTTCTGGTTTATGGAGGATGAGGCAGTTATGAGCAATATGATGGAACCAGGTACTAACATAAACAG @@>;>9:::A?=-,**///44564455E7200,,.02056;<>A==><=>==;;88@;?=<;;<;;<=BB@<<<=CB@=844..--,--55;<=D<<;<>=?>>>>@ACS>>?>;9:=GHEBB42<==@?<;;<?CBCEHKJ@=:::;;<=>?<70.--.<;<DGJ8HB=ACB>;97732(&&&(<;<:9;<>9568899<@=D>===?CBA76<;;;<@>><<99:90//)()+.-'&&%%&&+'''&%%'%##$$'&%%&$##" NM:i:2 ms:i:454 AS:i:454 nn:i:0 tp:A:P cm:i:40 s1:i:216 s2:i:0 de:f:0.0085 rl:i:15 NM:i:4 ms:i:1108 AS:i:1108 nn:i:0 tp:A:P cm:i:84 s1:i:487 s2:i:0 de:f:0.007 rl:i:114 qs:i:15 du:f:1.43725 ns:i:5749 ts:i:10 mx:i:2 ch:i:349 st:Z:2023-04-14T03:12:29.59+00:00 rn:i:25470 f5:Z:FAW60540_pass_9382fa45_a155afd1_502.pod5 sm:f:91.767 sd:f:26.724 sv:Z:quantile MM:Z:C+h?;C+m?; ML:B:C 0->268

Any ideas?

Noah

itslittman avatar Apr 27 '23 23:04 itslittman

Okay, so there was a smaller truncated pod5 written by the device (a software crash occurred during this run) that I included in the above sample data. I thought maybe that was the issue, so I omitted it and made a new FASTQ. I repeated everything I did above but got an extremely similar error:

[E::aux_parse] Incomplete aux field [W::sam_read1_sam] Parse error at line 197 samtools view: error reading file "new_sample.sam"

The offending line is:

125659e2-2885-4568-b285-f3304db0097e 16 chr4 68049456 60 16S119M1I32M1I93M1I52M3I130M1I70M1D5M3D70M41S * 0 0 CACCATTGTGATTTTACTTCTGTTTCCTTAGGTATAGTTGAATATGGTTTTAAGTAGGTTGTTGTCTTAATCATAAAACTGGTTTGGTCCAACAATATACTTTGCAGAACACCATGTAAGTTTAGCTTACTTCACTTTTTTTTTTTTTTTTGGCATCCAATTGATGTGAATATAAGATCTTGCAATTCTTATTATTCAATTACAGAAATTAAGCATAATTTACTTACTTAGGGCTTTGAGGATTCTTGGTCTGATTTAACCTGAAGTTTCTAGTTTAATATCTAAAATATTGGTGAGATTGGCAGGAATAGGGCTGAAGAGGTAATAATGAATGGTGGTAGGAGGAGGATCTGTTTTATGTTTTAAAATTACAATTTTAATGTTATTAGTTGTTGGGAACATCATTAGTGATGTGGAATAAATTAGATGTATATAAAAGTACAGGTTGTAGAAGAGCTATGATAGCAATGAATGTAAGGATAATGAGGCTATTATTTTTTGTTATTTCTTGAATCATAGTCATTGTAAAAATCCTGTTAGCAGGGGCAGTTCTCCTAAGGATAGTAGAATAACAAGGATTATAGATGTTAGTGAACAATGTTTGAGTTGTTCAAGGTATAGGCATAACACCAGAG %&&'(%%&(('''),0667>CE:9<:;2176798&&&'(.1299<>ED@?@A>;;;<;;;;==??CA==<=@@B;;:2385::953332;>AA@D:::?@,,..005558763333559>:6:::<;;:;322;=?CFDEJGDCBB::64777989>@@?@FC@=>?@FDDBA>>?@JNIFEEC>CDCC=?888;CIBBBIFHF????JKOFD6=?;77-,,,.//0756612888===>?AB>334*)+*-/117677897767;>>>DEEL988;?>?A@<8877('''3)'&('&#$'+++,--/4>>>=@=8777:==>=>=;92108==>*)))3++,++++-;<?@ACC??@@CBA512/'''((##$%'%$$&$%&%$%%))'%$%$%&$&'&)&$###$"#" NM:i:17 ms:i:1060 AS:i:1056 nn:i:0 tp:A:P cm:i:80 s1:i:477 s2:i:0 de:f:0.0225 rl:i:15 None

I would note that the unsplit FASTQ produced no errors whatsoever after and looked perfect in IGV (other than all the duplex reads I'm trying to separate).

itslittman avatar Apr 28 '23 00:04 itslittman

Hi @itslittman, thanks for the question.

To me it looks like the trailing None is quite suspicious, it would be great if it's possible to see if that's part of the original (unsplit) file. I'm noting that 125659e2-2885-4568-b285-f3304db0097e doesn't have a trailing _1 or _2 in it, so it shouldn't have been split.

Cheers

ollenordesjo avatar Apr 28 '23 08:04 ollenordesjo

@ollenordesjo The first offending line doesn't have a None at the end and does have a '_1' in the readname - does that one have anything else that looks fishy or could the modified base tags be causing an issue? Should I run split with the 'debug_output' option?

Noah

itslittman avatar Apr 29 '23 04:04 itslittman

@ollenordesjo I pulled the line of the unsplit read from the mapped SAM file (the one I generated before realizing there were a lot of concatenated reads):

125659e2-2885-4568-b285-f3304db0097e 16 chr4 68049456 60 16S119M1I32M1I93M1I52M3I130M1I70M1D5M3D70M41S * 0 0 CACCATTGTGATTTTACTTCTGTTTCCTTAGGTATAGTTGAATATGGTTTTAAGTAGGTTGTTGTCTTAATCATAAAACTGGTTTGGTCCAACAATATACTTTGCAGAACACCATGTAAGTTTAGCTTACTTCACTTTTTTTTTTTTTTTTGGCATCCAATTGATGTGAATATAAGATCTTGCAATTCTTATTATTCAATTACAGAAATTAAGCATAATTTACTTACTTAGGGCTTTGAGGATTCTTGGTCTGATTTAACCTGAAGTTTCTAGTTTAATATCTAAAATATTGGTGAGATTGGCAGGAATAGGGCTGAAGAGGTAATAATGAATGGTGGTAGGAGGAGGATCTGTTTTATGTTTTAAAATTACAATTTTAATGTTATTAGTTGTTGGGAACATCATTAGTGATGTGGAATAAATTAGATGTATATAAAAGTACAGGTTGTAGAAGAGCTATGATAGCAATGAATGTAAGGATAATGAGGCTATTATTTTTTGTTATTTCTTGAATCATAGTCATTGTAAAAATCCTGTTAGCAGGGGCAGTTCTCCTAAGGATAGTAGAATAACAAGGATTATAGATGTTAGTGAACAATGTTTGAGTTGTTCAAGGTATAGGCATAACACCAGAG %&&'(%%&(('''),0667>CE:9<:;2176798&&&'(.1299<>ED@?@A>;;;<;;;;==??CA==<=@@B;;:2385::953332;>AA@D:::?@,,..005558763333559>:6:::<;;:;322;=?CFDEJGDCBB::64777989>@@?@FC@=>?@FDDBA>>?@JNIFEEC>CDCC=?888;CIBBBIFHF????JKOFD6=?;77-,,,.//0756612888===>?AB>334*)+*-/117677897767;>>>DEEL988;?>?A@<8877('''3)'&('&#$'+++,--/4>>>=@=8777:==>=>=;92108==>*)))3++,++++-;<?@ACC??@@CBA512/'''((##$%'%$$&$%&%$%%))'%$%$%&$&'&)&$###$"#" NM:i:17 ms:i:1060 AS:i:1056 nn:i:0 tp:A:P cm:i:80 s1:i:477 s2:i:0 de:f:0.0225 rl:i:15 qs:i:12 du:f:1.24775 ns:i:4991 ts:i:10 mx:i:2 ch:i:8 st:Z:2023-04-12T23:10:51.660+00:00 rn:i:12959 f5:Z:FAW60540_pass_a6ad0ce3_11d41050_402.pod5 sm:f:92.951 sd:f:27.937 sv:Z:quantile MM:Z:C+h?;C+m?; ML:B:C

itslittman avatar Apr 30 '23 03:04 itslittman

Hi @itslittman,

Yes, it does indeed look like some tags have been removed: qs:i:12 du:f:1.24775 ns:i:4991 ts:i:10 mx:i:2 ch:i:8 st:Z:2023-04-12T23:10:51.660+00:00 rn:i:12959 f5:Z:FAW60540_pass_a6ad0ce3_11d41050_402.pod5 sm:f:92.951 sd:f:27.937 sv:Z:quantile MM:Z:C+h?;C+m?; ML:B:C

Having a closer look, it looks like adding the span of bases from which the read was extracted may have caused a problem:

https://github.com/nanoporetech/duplex-tools/blob/master/duplex_tools/split_on_adapter.py#L177

We add the {start}->{end} information after the comments, so it could be the case that attempting to parse this as a tag doesn't work. I think this is the issue.

If you either remove this data from the files, and then proceed with your steps 4 and 5 from your original comment, I think you might get it working. In that case, I'll put a TODO to either move or remove this additional metadata in the fastq comments.

Cheers!

ollenordesjo avatar May 02 '23 09:05 ollenordesjo

@ollenordesjo how do I remove this data?

itslittman avatar May 02 '23 17:05 itslittman

@ollenordesjo also how are other people getting good results without doing this if the start/end comment is automatically written?

Noah

itslittman avatar May 08 '23 15:05 itslittman

Hi @itslittman, sorry for late reply.

The easiest way to remove it from files you already have is probably this (which will remove -> from the sam file)

sed 's/ [0-9]\+->[0-9]\+$//' yoursam.sam

I think it's likely that people have not hit this issue if they haven't passed the tags from the usam through to mapping.

Would it be possible to share any flags you used for minimap for the workflow below that you shared? I'm guessing you're using -y, but would be great to know the exact process so we're not missing any details when writing the tests

basecall with Dorado (400bps DNA HAC w/5mC and 5hmC)
convert unmapped SAM to FASTQ with samtools fastq -T '*'
split_on_adapter with --allow-multiple-splits option
map with minimap2
convert output to BAM

ollenordesjo avatar May 09 '23 08:05 ollenordesjo

@ollenordesjo okay thanks i'll try that. Also the only other flag I'm using with minimap2 is -Y, so that everything is soft clipped (no hard clipping). Makes it easier for me to BLAT stuff (I'm dealing with fusion genes in leukemia samples).

Noah

itslittman avatar May 11 '23 22:05 itslittman

@ollenordesjo the sed command did not work though - same error.

Edit: I ran split with the non-modified-basecalling files output during the original run and everything converted to BAM fine. So it seems like the -y flag is to blame. Yet removing the comments the way you suggested didn't work - could it be MM and ML tags themselves that are causing the problem somehow? Even though they don't cause a problem when reads are not split?

Noah

itslittman avatar May 11 '23 23:05 itslittman

@ollenordesjo could my Samtools version be an issue?

itslittman avatar Jun 19 '23 16:06 itslittman

Hi @itslittman, sorry for responding earlier. Not sure if the samtools version could be the issue. Which version have you tried? If you're able to upload the sam that is showing the issue I'm happy to see if I can spot the problem.

ollenordesjo avatar Jun 20 '23 08:06 ollenordesjo

@ollenordesjo It's all good!

I'm using Samtools version 1.16.1 and htslib 1.16, which are from 2022 so pretty recent. How would I give you access to the SAM? It's like 30GB so even if I wanted to post it right on here I don't think I could. Would a dropbox link work?

itslittman avatar Jun 20 '23 21:06 itslittman

Yep, samtools version shouldn't be the problem then. If you send an email to me on olle[email protected] I can send you a link where you can upload it

ollenordesjo avatar Jun 21 '23 07:06 ollenordesjo

@ollenordesjo it's saying undeliverable to that address

itslittman avatar Jul 02 '23 14:07 itslittman

Ah, so sorry! It's olle [dot] [email protected]

ollenordesjo avatar Jul 07 '23 08:07 ollenordesjo

Hi @itslittman, received the sam and can confirm that it's possible to fix the offending lines with the following:

sed -E 's/ [0-9]+->[0-9]+$//g' split.head.sam > split.fixed.sam

Can you let me know if this fixes the problem?

Thanks!

Rasinj avatar Jul 31 '23 08:07 Rasinj