nanopolish
nanopolish copied to clipboard
call-methylation returns non-CG methylated site?
I ran the example in https://nanopolish.readthedocs.io/en/latest/quickstart_call_methylation.html. However, I noticed that some reported coordinates were not pointed to C/G base and probably shift -1 base.
For example, in methylation_frequency.tsv
chr20 4988399 4988399 1 4 4 1.000 GAGTTCGAGAT
^^
but the mC in at 4988400bp
$ samtools faidx reference.fasta chr20:4988399-4988399
>chr20:4988399-4988399
T
$ samtools faidx reference.fasta chr20:4988395-4988405
>chr20:4988395-4988405
GAGTTCGAGAT
Another example,
$ grep '5005939' bisulfite* methylation_frequency.tsv
bisulfite.ENCFF835NTC.example.tsv:chr20 5005939 5005940 . 12 + 5005939 5005940 255,0,0 12 100
bisulfite_vs_nanopolish.tsv:chr20:5005939-5005939 20 1.0000 5 1.0000
methylation_frequency.tsv:chr20 5005939 5005939 1 5 5 1.000 GAGATCGCACC
$ samtools faidx reference.fasta chr20:5005939-5005939
>chr20:5005939-5005939
T
I roughly counted that there were ~16k sites had this sort of behavior. Did I interpret wrong?
Hi,
The output files from call-methylation are zero-based, not 1-based. This is for compatibility with bisulfite sequencing data, which is usually in the 0-based bed format. The example bisulfite data we provide in the tutorial (bisulfite.ENCFF835NTC.example.tsv) should align with our calls.
Jared
Thanks for the reply.
The first base should be [0,1]
in zero-based format, right? (ref)
As the second example, the same sites are
bisulfite.ENCFF835NTC.example.tsv
chr20 5005939 5005940 . 12 + 5005939 5005940 255,0,0 12 100
methylation_frequency.tsv
chr20 5005939 5005939 1 5 5 1.000 GAGATCGCACC
which is confusing.
Indeed, the end coordinates are encoded slightly differently. I will consider changing our end coordinate to match the BED output.
Thank you, this will be really helpful.