pairtools
pairtools copied to clipboard
single-end mode is broken in v1.1.0
I recently updated my conda installation of pairtools to v1.1.0 and it appears that the --single-end mode no longer works for single-end reads in pairtools parse2:
Comparing two versions:
pairtools 1.0.3 py310hb45ccb3_0 bioconda
pairtools 1.1.0 py312hac03d35_1 bioconda
Same command:
pairtools parse2 --single-end --nproc-in 12 --nproc-out 32 --min-mapq 1 --flip --add-columns mapq --assembly B73 -c ../ref/B73.chrom_sizes.tsv --output-stats test.tsv -o test.pairs test.input.bam
v1.0.3 gives the expected result, but v1.1.0 reports everything as unmapped:
v.1.10 results are entirely NN:
read1 ! 0 ! 0 - - NN 0 0
I just encountered the same problem. Furthermore, the parse2 --expand / --no-expand mode in pairtools 1.0.3 does not seem to perform combinatorial expansion and only outputs the unmodified pairtype. Maybe this could be relevant when looking into pairtools parse2.
hi folks! Could either of you share a sample .sam file with a few reads? This could be extremely useful in reproducing the issue!
I am having the same issue! Did you ever figure out what the problem was? If not, I am happy to provide a sample with a subset of the alignment I am using.
@Salvacasani That will be excellent! Please provide the command that you run on that sample file and a description of your expectations vs what you observe.
hi folks! Could either of you share a sample .sam file with a few reads? This could be extremely useful in reproducing the issue!
Unfortunately I'm working with private data, so unable to share anything.
Here it is, I don't know how many of these should have ligated junctions, but there should be a good bunch. If there are no examples with separate mapping sites in this subset, let me know the best way to get them. I am processing the data in the previous version of pairtools. I will let you know if that works. subsample.zip
Hi! Have you been able to look into it? I tried to run it with the previous version of pairtools, but I get the same result.
Hi @Salvacasani ,
You can check out the latest pull request here: https://github.com/open2c/pairtools/pull/251. It seems that some API changes didn’t fully align with the old functionality, which caused an issue with read side detection in certain datasets.
By the way, in your sample data, most of the reads are NU (unmapped-mapped) pairs, so I wasn’t able to fully test the pair expansion. However, pair expansion should be working now as well.
@Phlya, since you had questions about the read sides in your recent dataset, could you test if any part of your data produces the same results after this fix?
In theory, this issue should have affected both paired-end and single-end data, but it’s only appearing with single-end data, which is a bit puzzling. There might be another underlying issue that remains unresolved.
Thank you for looking into it. I am still getting very strange results. This is the code that I am using pairtools parse2 --min-mapq 20 --single-end --max-inter-align-gap 30 --nproc-in 5 --nproc-out 5 --add-columns mapq,pos5,pos3 --chroms-path /seq/epiprod02/reference_genomes/hg38/juicertools/hg38.genome -o subsampled_sorted_chr8.pairs.gz subsampled_sorted_chr8.bam practically all the reads look like NN pairs, which is strange based on the bam file itself.
To use one specific example. I looked into the results of read 150627-BC36-0251488282
Which seems to have two mapping sites in the bam file: 150627-BC36-0251488282 0 chr8 199436 0 149M61S * 0 0 AGACATAATATTATAATTCAAATGCAACTCATGGCTTCTCTTTGTACTCTTTCTCTAGCTTTTGAATTATTTATTCTAATACCAGTTTTAATTCTGACACAAAATCATGGGAGTTCTAATCAAAATCCAACCTTTTATCATAAAAACTAGCTGTCACTATTGTTAGCGGTGATGGGGCCCTCGGGGACTTTGGAGGGAAGCTCAGGCTGG 48IIII77IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII;IIIIIIIII9III1II1IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII<IIIIII:I:IIIIIIIIIIIIIIIIIIIII;II;IIIIIIIIIIIIIIIIIIIIIIIFFI@@III**IIII/==/DIDII,II,IIFIF77;I>998IIIIII5I,, NM:i:0 MD:Z:149 AS:i:149 XS:i:149 SA:Z:chr8,635368,-,63M147S,43,5; MQ:i:43 ip:i:199584 mp:i:635430 ep:i:199436 rt:A:0 cb:Z:chr8_chr8_0_1_0_0_000199584_000635430 RG:Z:HIC13907 150627-BC36-0251488282 272 chr8 635368 43 63M147H * 0 0 CCAGCCTGAGCTTCCCTCCAAAGTCCCCGAGGGCCCCATCACCGCTAACAATAGTGACAGCTA ,,I5IIIIII899>I>;77FIFII,II,IIDID/==/IIII**III@@IFFIIIIIIIIIIII NM:i:5 MD:Z:29C0A2G1T7A19 AS:i:38 XS:i:0 SA:Z:chr8,199436,+,149M61S,0,0; MQ:i:0 ip:i:635430 mp:i:199584 ep:i:635368 rt:A:1 cb:Z:chr8_chr8_0_1_0_0_000199584_000635430 RG:Z:HIC13907
but in the pairs file it shows as unmapped:
150627-BC36-0251488282 ! 0 ! 0 - - NN 0 0 0 0 0 0 150627-BC36-0251488282 ! 0 ! 0 - - NN 0 0 0 0 0 0
Do you know why this may be?
Thank you
Hi @Salvacasani,
Hmmm, can you try with pairtools-v1.1.0-fix branch of pairtools?
Basically, you can install that version with:
git clone https://github.com/open2c/pairtools
cd pairtools
git checkout checkout pairtools-v1.1.0-fix
pip install -e ./
With that version in progress, I receive "correct" results with your command:
pairtools parse2 --min-mapq 20 --single-end --max-inter-align-gap 30 --nproc-in 5 --nproc-out 5 --add-columns mapq,pos5,pos3 -c hg38.chrom.sizes.reduced ./subsampled2_sorted_chr8.bam | grep "150627-BC36-0251488282"
Output two pairs, both have some kind of issues (multi-mapper and unmapped segment):
150627-BC36-0251488282 ! 0 ! 0 - - MN 150627-BC36-02514882820chr81994360149M61S*00AGACATAATATTATAATTCAAATGCAACTCATGGCTTCTCTTTGTACTCTTTCTCTAGCTTTTGAATTATTTATTCTAATACCAGTTTTAATTCTGACACAAAATCATGGGAGTTCTAATCAAAATCCAACCTTTTATCATAAAAACTAGCTGTCACTATTGTTAGCGGTGATGGGGCCCTCGGGGACTTTGGAGGGAAGCTCAGGCTGG48IIII77IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII;IIIIIIIII9III1II1IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII<II<IIII:I:IIIIIIIIIIIIIIIIIIIII;II;IIIIIIIIIIIIIIIIIIIIIIIFFI@@III**IIII/==/DIDII,II,IIFIF77;>I>998IIIIII5I,,NM:i:0MD:Z:149AS:i:149XS:i:149SA:Z:chr8,635368,-,63M147S,43,5;MQ:i:43ip:i:199584mp:i:635430ep:i:199436rt:A:0cb:Z:chr8_chr8_0_1_0_0_000199584_000635430RG:Z:HIC13907Yt:Z:MN 0 0 0 0 0 0
150627-BC36-0251488282 ! 0 chr8 635368 - + NU 150627-BC36-0251488282272chr86353684363M147H*00CCAGCCTGAGCTTCCCTCCAAAGTCCCCGAGGGCCCCATCACCGCTAACAATAGTGACAGCTA,,I5IIIIII899>I>;77FIFII,II,IIDID/==/IIII**III@@IFFIIIIIIIIIIIINM:i:5MD:Z:29C0A2G1T7A19AS:i:38XS:i:0SA:Z:chr8,199436,+,149M61S,0,0;MQ:i:0ip:i:635430mp:i:199584ep:i:635368rt:A:1cb:Z:chr8_chr8_0_1_0_0_000199584_000635430RG:Z:HIC13907Yt:Z:NU 0 43 0 635368 0 635430
Is that what you would expect for that read?
Thanks I can reproduce this result. Since there is a multimaper (not sure why since I don't find it in the alignment) maybe this is not the best example.
If we look at this other example 180786-UGAv3-23-4079533886
This is the bam file:
180786-UGAv3-23-4079533886 0 chr8 310577 42 22S162M57S * 0 0 CTACACGACGCTCTTCCGATCTACTATACCTTCATTCATTAATGTGTCATTTCTTTCAGGAACTATTTTCTGAGTCTCAAACATATTTCATAGCACTCTCAAGCTTGAGGTTCTGCCTGAACATGCTCCTCCTCCTTTTCTTTTGTGACTCTTCATTTCGTATGGAGTTAAACCTGAGCTCCTAAGCTTTGTCTGTTTCCACTTTTGTCAGATTCGATGGCAACAAGATGAATGCGCCCTG 88IIIIIIIIIIIIIIIIIIII;II7IIIIIIIIIIIIIIIIIII:IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIICCIIIIII@DD@I9==9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII7I7I"III?I?==II'88'&-IIII11II,I++1//I++86,$99I6#I&?&I4 NM:i:0 MD:Z:162 AS:i:162 XS:i:142 SA:Z:chr8,312049,+,191S37M1D13M,48,2; XA:Z:chr1,-848200,57S162M22S,4; MQ:i:48 ip:i:310738 mp:i:312049 ep:i:310555 rt:A:0 cb:Z:chr8_chr8_0_1_0_16_000310738_000312049 RG:Z:HIC13904 180786-UGAv3-23-4079533886 256 chr8 312049 48 191H37M1D13M * 0 0 TCTGTTTCCACTTTTGTCAGATTCGATGGCAACAAGATGAATGCGCCCTG "III?I?==II'88'&-IIII11II,I++1//I++86,$99I6#I&?&I4 NM:i:2 MD:Z:37^T4G8 AS:i:38 XS:i:21 SA:Z:chr8,310577,+,22S162M57S,42,0; MQ:i:42 ip:i:312049 mp:i:310738 ep:i:312099 rt:A:1 cb:Z:chr8_chr8_0_1_0_16_000310738_000312049 RG:Z:HIC13904
this is the pairtools result:
180786-UGAv3-23-4079533886 chr8 310577 ! 0 + - UN 180786-UGAv3-23-40795338860chr83105774222S162M57S00CTACACGACGCTCTTCCGATCTACTATACCTTCATTCATTAATGTGTCATTTCTTTCAGGAACTATTTTCTGAGTCTCAAACATATTTCATAGCACTCTCAAGCTTGAGGTTCTGCCTGAACATGCTCCTCCTCCTTTTCTTTTGTGACTCTTCATTTCGTATGGAGTTAAACCTGAGCTCCTAAGCTTTGTCTGTTTCCACTTTTGTCAGATTCGATGGCAACAAGATGAATGCGCCCTG88IIIIIIIIIIIIIIIIIIII;II7IIIIIIIIIIIIIIIIIII:IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIICCIIIIII@DD@I9==9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII7I7I"III?I?==II'88'&-IIII11II,I++1//I++86,$99I6#I&?&I4NM:i:0MD:Z:162AS:i:162XS:i:142SA:Z:chr8,312049,+,191S37M1D13M,48,2;XA:Z:chr1,-848200,57S162M22S,4;MQ:i:48ip:i:310738mp:i:312049ep:i:310555rt:A:0cb:Z:chr8_chr8_0_1_0_16_000310738_000312049RG:Z:HIC13904Yt:Z:UN 42 0 310577 0 310738 0 180786-UGAv3-23-4079533886 ! 0 chr8 312099 - - NU 180786-UGAv3-23-4079533886256chr831204948191H37M1D13M00TCTGTTTCCACTTTTGTCAGATTCGATGGCAACAAGATGAATGCGCCCTG"III?I?==II'88'&-IIII11II,I++1//I++86,$99I6#I&?&I4NM:i:2MD:Z:37^T4G8AS:i:38XS:i:21SA:Z:chr8,310577,+,22S162M57S,42,0;MQ:i:42ip:i:312049mp:i:310738ep:i:312099rt:A:1cb:Z:chr8_chr8_0_1_0_16_000310738_000312049RG:Z:HIC13904Yt:Z:NU 0 48 0 312099 0 312049
Maybe I misunderstand the alignment, but isn't it indicating that these two are chimeric alignments from the same read? If that is the case, shouldn't these be pairs and be in the same line of the pairs file as linked fragments?
Thank you!
Salva
@Salvacasani , that happens because your bam file is not sorted by read ID, and pairtools parse relies on parts of the same read arranged next to each other. Simple reordering of your .bam file with:
samtools sort -n subsampled2_sorted_chr8.bam > subsampled2_re-sorted_chr8.bam
produces these three nice pairs for the read 180786-UGAv3-23-4079533886:
parse2 --no-flip --drop-sam --drop-seq --single-end --max-inter-align-gap 20 --nproc-in 5 --nproc-out 5 --add-columns mapq -c hg38.chrom.sizes.reduced ./subsampled2_re-sorted_chr8.bam
180786-UGAv3-23-4079533886 ! 0 chr8 310738 - - NU 0 42
180786-UGAv3-23-4079533886 chr8 310577 chr8 312099 + - UU 42 48
I thought that we mention in the manual that the input alignments shall be sorted by the read ID, but I cannot find any mention of that. Attention @golobor , it might be worthy to add it.
Gotcha, Thank you so much for the help!!
I tried it with the new version pairtools-v1.1.0-fix and it still does not work for me.
This is the command I used:
pairtools parse2 --output-stats file.pairs.stats -c GRCh38.rg.fa.fai --single-end --readid-transform 'readID.split(":")[0]' --drop-sam --drop-seq --add-pair-index --add-columns mapq,pos5,pos3,cigar,read_len,matched_bp,algn_ref_span,algn_read_span,dist_to_5,dist_to_3,mismatches file.bam > file.pairs
Output example for pairtools-v1.1.0-fix
62716bd6-e872-4f1c-a787-00906707b280 chr3 38712540 ! 0 + - UN 1 R1 60 0 38712540 0 38713313 0 1S140M1I4M1I76M1D240M3D86M1I16M2I1M1I2M1I2M1D148M2D52M * 775 0 767 0 774 0 774 0 1 0 0 0
62716bd6-e872-4f1c-a787-00906707b280 chr3 38711912 ! 0 - - UN 1 R1 11 0 38711912 0 38711850 0 52M1D10M * 62 0 62 0 63 0 62 0 0 0 0 0
62716bd6-e872-4f1c-a787-00906707b280 chr3 38711849 ! 0 - - UN 1 R1 60 0 38711849 0 38711626 0 155M1I16M1D3M1I49M * 225 0 223 0 224 0 225 0 0 0 0 0
62716bd6-e872-4f1c-a787-00906707b280 chr3 38711625 ! 0 - - UN 1 R1 27 0 38711625 0 38711538 0 88M * 88 0 88 0 88 0 88 0 0 0 0 0
Output example for pairtools 1.0.3
62716bd6-e872-4f1c-a787-00906707b280 chr3 38711912 chr3 38711626 - + UU 1 R1 11 60 38711912 38711626 38711850 38711849 52M1D10M 155M1I16M1D3M1I49M 62 225 62 223 63 224 62 225 0 0 0 0
62716bd6-e872-4f1c-a787-00906707b280 chr3 38711849 chr3 38711538 - + UU 2 R1 60 27 38711849 38711538 38711626 38711625 155M1I16M1D3M1I49M 88M 225 88 223 88 224 88 225 88 0 0 0 0
62716bd6-e872-4f1c-a787-00906707b280 chr3 38711625 chr3 38712008 - + UU 3 R1 27 60 38711625 38712008 38711538 38712146 88M 86M1I53M 88 140 88 139 88 139 88 140 0 0 0 0
62716bd6-e872-4f1c-a787-00906707b280 chr3 38712146 chr3 38710966 - - UU 4 R1 60 60 38712146 38710966 38712008 38710569 86M1I53M 137M1I110M1I151M 140 400 139 398 139 398 140 400 0 0 0 0
I can as well provide a small subset bam file. Just tell me where I should send it.
@Henrikkoe , thanks! That might be due to readID transformation. I believe I have brought it now to the working state in the same branch: https://github.com/open2c/pairtools/pull/251
Do you mind pulling the most recent version of the branch and checking the output?
If the problem persists, you can run the following:
samtools view -h file.bam | head -n 1000 | samtools view -bS - > sample.bam
and post the file either through zip attachements to github messages or though OSF for further debug.
Thanks, this seems to be working now.
I have a follow up question to the --single-end mode but I can also shift this into a new issue. But this might be also related.
Could I use the --single-end command together with --expand? Because in combination the output is the same to --single-end only without any combinatorial expansion and using only--expand ends up in a weird combination and small number of interactions.
--single-end only or --single-end + --expand
b5bc7155-a245-4aeb-9c79-d231e824fcaa chr8 81338701 chr8 81387024 + + UU 20 R1 60 60 81338701 81387024 81339221 81387212 17S102M1D23M1D20M3D5M1D17M1I5M1D57M1D35M1I2M1D13M1I1M2I2M1D34M3D20M1D5M1D6M1D38M1D1M1D95M3D2M1D16M 14M1I28M1I47M1D2M1D96M43S 521 232 499 187 521 189 504 189 17 43 0 0
b5bc7155-a245-4aeb-9c79-d231e824fcaa chr8 81387212 chr8 81384871 - + UU 21 R1 60 60 81387212 81384871 81387024 81385252 14M1I28M1I47M1D2M1D96M43S 160M1D114M3D104M94S 232 472 187 378 189 382 189 378 43 94 0 0
b5bc7155-a245-4aeb-9c79-d231e824fcaa chr8 81385252 chr8 81623567 - - UU 22 R1 60 60 81385252 81623567 81384871 81623350 160M1D114M3D104M94S 348S63M1D154M1S 472 566 378 217 382 218 378 217 94 348 0 1
--expand only
b5bc7155-a245-4aeb-9c79-d231e824fcaa ! 0 chr8 81623350 - + NU 1 R1-2 0 60 0 81623350 0 81623567 * 348S63M1D154M1S 0 566 0 217 0 218 0 217 0 348 0 1
b5bc7155-a245-4aeb-9c79-d231e824fcaa ! 0 chr8 81369851 - + NU 1 E1_R1-2 0 60 0 81369851 0 81370157 * 80M1I2M3D127M1D12M2D49M1D10M3D17M22S 0 320 0 297 0 307 0 298 0 0 0 22
b5bc7155-a245-4aeb-9c79-d231e824fcaa ! 0 chr8 81369851 + + MU 2 R2 0 60 0 81369851 0 81370157 75M1I93M 80M1I2M3D127M1D12M2D49M1D10M3D17M22S 169 320 168 297 168 307 169 298 0 0 0 22
@Henrikkoe Good catch! This issue was inherited from earlier versions of pairtools, so thanks for pointing it out. Now --single-end + --expand should work as expected. Based on your feedback, I've also added tests for single-end and expanded regimes. Let us know if you notice any unexpected behavior!
Closed as resolved!