nanopolish icon indicating copy to clipboard operation
nanopolish copied to clipboard

call-methylation returns non-CG methylated site?

Open cgjosephlee opened this issue 5 years ago • 4 comments

I ran the example in https://nanopolish.readthedocs.io/en/latest/quickstart_call_methylation.html. However, I noticed that some reported coordinates were not pointed to C/G base and probably shift -1 base.

For example, in methylation_frequency.tsv

chr20   4988399 4988399 1       4       4       1.000   GAGTTCGAGAT
                                                             ^^

but the mC in at 4988400bp

$ samtools faidx reference.fasta chr20:4988399-4988399
>chr20:4988399-4988399
T

$ samtools faidx reference.fasta chr20:4988395-4988405
>chr20:4988395-4988405
GAGTTCGAGAT

Another example,

$ grep '5005939' bisulfite* methylation_frequency.tsv
bisulfite.ENCFF835NTC.example.tsv:chr20 5005939 5005940 .       12      +       5005939 5005940 255,0,0 12      100
bisulfite_vs_nanopolish.tsv:chr20:5005939-5005939       20      1.0000  5       1.0000
methylation_frequency.tsv:chr20 5005939 5005939 1       5       5       1.000   GAGATCGCACC

$ samtools faidx reference.fasta chr20:5005939-5005939
>chr20:5005939-5005939
T

I roughly counted that there were ~16k sites had this sort of behavior. Did I interpret wrong?

cgjosephlee avatar Apr 18 '19 09:04 cgjosephlee

Hi,

The output files from call-methylation are zero-based, not 1-based. This is for compatibility with bisulfite sequencing data, which is usually in the 0-based bed format. The example bisulfite data we provide in the tutorial (bisulfite.ENCFF835NTC.example.tsv) should align with our calls.

Jared

jts avatar Apr 18 '19 13:04 jts

Thanks for the reply. The first base should be [0,1] in zero-based format, right? (ref) As the second example, the same sites are

bisulfite.ENCFF835NTC.example.tsv
chr20 5005939 5005940 .       12      +       5005939 5005940 255,0,0 12      100

methylation_frequency.tsv
chr20 5005939 5005939 1       5       5       1.000   GAGATCGCACC

which is confusing.

cgjosephlee avatar Apr 18 '19 14:04 cgjosephlee

Indeed, the end coordinates are encoded slightly differently. I will consider changing our end coordinate to match the BED output.

jts avatar Apr 18 '19 15:04 jts

Thank you, this will be really helpful.

cgjosephlee avatar Apr 18 '19 16:04 cgjosephlee