methylpy icon indicating copy to clipboard operation
methylpy copied to clipboard

Low percentage of uniquely mapping read pairs

Open GatorShan opened this issue 4 years ago • 7 comments

Dear Yupeng,

Methylpy output showed a low rate of uniquely mapping read pairs:

There are 258651069 total input read pairs
Wed Dec 25 08:40:05 2019

There are 15891937 uniquely mapping read pairs, 6.14416057179 percent remaining
Wed Dec 25 08:40:05 2019

There are 13360495 non-clonal read pairs, 5.16545129763 percent remaining
Wed Dec 25 08:40:05 2019

For the same dataset, however, the bowtie2 mapping report showed a pretty high mapping rate:

258648925 reads; of these:
  258648925 (100.00%) were paired; of these:
    109280474 (42.25%) aligned concordantly 0 times
    26355304 (10.19%) aligned concordantly exactly 1 time
    123013147 (47.56%) aligned concordantly >1 times
57.75% overall alignment rate
258648925 reads; of these:
  258648925 (100.00%) were paired; of these:
    109452788 (42.32%) aligned concordantly 0 times
    26285006 (10.16%) aligned concordantly exactly 1 time
    122911131 (47.52%) aligned concordantly >1 times
57.68% overall alignment rate

I am curious about why would this happen. Could you please take a look? Does this dataset look bad? Please let me know if you need any other information.

Thanks and happy holidays!!

Shan

GatorShan avatar Dec 25 '19 18:12 GatorShan

Hey Shan, It looks like that you are working on a highly repetitive genome and/or your read length is short. Yes, the bowtie2 mapping rate is high but most of the mapped reads can be aligned to multiple locations. Such reads are considered as multi-mapped reads and will be discarded by methylpy. That would lead to the low number of reads being kept.

Not sure what the best way is to solve this issue. What genome are you working on and what are the lengths of your reads?

Yupeng

yupenghe avatar Dec 27 '19 05:12 yupenghe

Hi Yupeng,

Thanks!

We are working a non-model plant species: genus Tragopogon from sunflower family. Our read length is 150 bp and we sequenced MethylC-seq libraries on NovaSeq platform. In terms our reference genome, it is a draft genome -- which covers only 40% of the whole genome and should not include many repetitive sequences.

Any clue? Any suggestions are welcome!

Best,

Shan

GatorShan avatar Dec 27 '19 14:12 GatorShan

Hi Shan, You can use --keep-temp to save the intermediate files. There are two SAM files storing the alignment results to the forward reference and reversed reference separately. It may be helpful to check each of the SAM file for the reads that are mapped to multiple locations (which will not be kept by methylpy in the final BAM file).

Yupeng

yupenghe avatar Jan 02 '20 00:01 yupenghe

Hi Yupeng,

Thanks for your suggestions! Will do it now.

Best,

Shan

On Jan 1, 2020, at 7:36 PM, Yupeng He <[email protected]mailto:[email protected]> wrote:

[External Email]

Hi Shan, You can use --keep-temp to save the intermediate files. There are two SAM files storing the alignment results to the forward reference and reversed reference separately. It may be helpful to check each of the SAM file for the reads that are mapped to multiple locations (which will not be kept by methylpy in the final BAM file).

Yupeng

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_yupenghe_methylpy_issues_46-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAC24ZRKFBEPC5BGFFZD77IDQ3UZJ5A5CNFSM4J7GAUIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH5P3NQ-23issuecomment-2D570097078&d=DwMCaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=VG6HY17bK8z6uQL0mtbD3JQlnoYkr6ljt6nr5ho4eZg&m=oH0s1taaZAeC9BPL25G85BWovBhVBhm-lf3GeivBERM&s=F2VCtifbptqOPJ-EfJtONbWfqt9AUr8QK99irtbVu3k&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AC24ZRI6GKDWSWFMVCYR4RLQ3UZJ5ANCNFSM4J7GAUIA&d=DwMCaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=VG6HY17bK8z6uQL0mtbD3JQlnoYkr6ljt6nr5ho4eZg&m=oH0s1taaZAeC9BPL25G85BWovBhVBhm-lf3GeivBERM&s=RiBLUyzbAd4YHuP2w9INVHKNY9CKCigC27b-j9x86yk&e=.

GatorShan avatar Jan 02 '20 02:01 GatorShan

Hi Yupeng,

Sorry for the late reply. I have used --keep-temp, and got two SAM files below:

269G Jan  3 01:59 T.pratensis_3058-4-10_libA_forward_strand_hits.sam
269G Jan  4 03:28 T.pratensis_3058-4-10_libA_reverse_strand_hits.sam

Could you please let me know how could I check their qualities?

Thanks!

Shan

GatorShan avatar Jan 14 '20 23:01 GatorShan

A few points to check:

  • For reads mapped to forward reference (or reverse), where are they mapped to and how many locations can they be mapped to?
  • How many reads can be uniquely mapped if we focus on one reference (either forward or reverse)? How many of these uniquely mapped reads can also be uniquely mapped on the other reference?

yupenghe avatar Jan 19 '20 06:01 yupenghe

Thanks Yupeng for your great help!

Shan

GatorShan avatar Jan 19 '20 20:01 GatorShan