Criteria for deciding whether a position is methylated
Hi,
I was analysing the data obtained using nanapore. I am looking for methylations in specific areas of the genome (6mA). I have used modkit, here is my code:
modkit pileup /mnt/c/Nanopore_Minion1kc/Barcode01_sorted_v4.bam output_V4_011.bed --filter-threshold A:0.8 --mod-thresholds a:0.9 --ref /mnt/c/Nanopore_Minion1kc/XXX.fasta --log-filepath /mnt/c/Nanopore_Minion1kc/modki_pileup.log
But when it comes to interpreting the data I feel a bit lost. What is the criterion to determine that a position is methylated? For example if out of 1500 reads 700 are methylated in a position, would that ratio be enough to consider that position methylated? Or is there any other indication to consider it methylated? Sorry if the question is a bit simple.
Best regards and thanks
Hello @angelluigi,
Your command looks fine. The algorithm in pileup only performs counting, so when you see a results such as 700 of 1500 reads reporting 6mA, that really just indicates that the empirical rate of 6mA at a site is 46.7%. But what that means depends on your experiment. From the table you've attached it looks like you have very high $N_{\text{valid coverage}}$ (see the note about this at the end of the message), so I would expect that the empirical estimate to be quite close to the true rate of 6mA in the sample. To continue with your example, intuitively I would not consider a position as methylated if just about half of the reads report methylation. However, if there were two haplotypes and one was methylated and one was not it would explain why you see about half reporting modification - so you have to look and see what's going on. What kind of sample is this? What is the mechanism of methylation in the sample?
A couple of things I would look for to sanity check my results:
- You should have an expectation of the coverage in your data. I would remove bedmethyl records with low $N_{\text{valid coverage}}$ and/or positions that are not
Ain the reference genome. - Follow up on sites reporting high 6mA rates by looking at them in a visualization tool such as IGV. If a site is erroneously being called as modified due to a sequencing artifact your eye can usually see it.
- If this is a bacterial sample, or a sample where you expect sequence-specific methylation, you could try
modkit find-motifs.
Quick note about your table As you can see in the schema, column 9 is the "color" and is always "255,0,0". It looks like you've parsed that column incorrectly. I would fix this first.
Happy to help if you give me a little more information about your experiment. Good luck.
Hi @ArtRand
Thank you very much for the answer. Let me tell you a bit more about the experiment. Our samples are different phages, control and mutants. Our hypothesis is that they have different methylation patterns. However, my feeling is that they have similar methylation patterns, according to my interpretation of the data.
As you suggest I think it is a good idea to filter out those with low coverage, in fact reading an article it says the following: "We then filtered the data to only include CpG sites that had a minimum coverage of 20x and maximum coverage of 200x."
While this is another type of modification, could a similar approach be taken? On the other hand, could an excess of coverage have any repercussions?
Thank you very much
Hello @angelluigi,
In general, there is no disadvantage to increasing coverage. However, beyond 200x is probably not necessary unless you think there are rare sub-populations in your data (again it comes back to what your biological question is). The potential exception is when calculating methylation entropy, for a discussion regarding that you may be interested in this thread.
@angelluigi you might be able to use a tool like methylartist https://github.com/adamewing/methylartist to plot methylation of each phage at a zoomed out level. This might be helpful to check your hypotheses.
Also modkit dmr might give you differentially methylated regions.