nanopolish icon indicating copy to clipboard operation
nanopolish copied to clipboard

Very high methylation frequencies from PromethION wgs run

Open macieksk opened this issue 4 years ago • 24 comments

Hi, yesterday we've run nanopolish call-methylation according to Live methylation calling instruction.

NCALLS=12
time parallel -u 'nanopolish call-methylation -v \
        --progress \
        -g "'"$REF"'" \
        -t "'"$THREADS"'" \
        --watch-write-bam \
        --watch-process-index="{}" --watch-process-total="'"$NCALLS"'" \
        --watch "'"$(realpath "$SEQRUNDIR")"'" \
        ' ::: $(seq 0 "$((NCALLS-1))") &

We've run this on a human WGS PromethION data with N50~=15900, median read length ~14500, there was ~29Gbases of data in fastq_pass folder.

    "flow_cell_product_code": "FLO-PRO002",
    "guppy_version": "3.2.8+bd67289",

However, the methylation results contain unusually high methylation frequencies:

> all.methylation.freq<-read.table("aggregated_methylation_frequency_all.tsv", as.is = TRUE,header=TRUE)
> summary(all.methylation.freq$methylated_frequency)                                                                                              
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.6670  0.8890  0.7925  1.0000  1.0000 

Comparison with ENCODE bisulfite sites, after the nanopolish tutorial, also show that our data is "overmethylated": image (edit: for this plot we've taken only spots with coverage >=5x, though the plot on all data looked very similar)

Could this be connected to lowering "the default methylation calling log-likelihood ratio threshold from 2.5 to 2.0"? Does nanopolish suppose to work well with PromethION data?

macieksk avatar Mar 10 '20 16:03 macieksk

Hi @macieksk,

That is strange! It should work on PromethION data without a problem. Can you confirm this is R9.4.1?

You can test whether it is a LLR issue by using -c 2.5 to calculate_methylation_frequency.py but I doubt it would have such a large effect.

Jared

jts avatar Mar 10 '20 17:03 jts

Hi @jts,

Yes, I confirmed this is R9.4.1 flowcell. Also indeed, it doesn't seem that increasing the threshold would change the results a lot:

> all.methylation.calls<-read.table("aggregated_methylation_calls_all.tsv.gz", as.is = TRUE,header=TRUE)
> mean(all.methylation.calls$log_lik_ratio>2.0)                                                                                                   
[1] 0.5301776
> mean(all.methylation.calls$log_lik_ratio>2.5)                                                                                                   
[1] 0.4818015

Hmm, it seems that we cannot currently exclude that this result is correct, specific for this patient. We will run another test on another sample.

macieksk avatar Mar 10 '20 17:03 macieksk

Ok, thanks for letting me know. IIRC NA12878 has somewhat low methylation compared to normal human samples.

Jared

jts avatar Mar 10 '20 17:03 jts

Hi,

Was this resolved?

Jared

jts avatar Mar 25 '20 19:03 jts

Hi,

I run the same analysis (nanopolish=0.12.0) on two more samples: one sequenced on MinION MIN-106 SQK-LSK108 (a patient with inborn disease), second one sequenced on PromethION 1.5 year ago, however recently rebasecalled_guppy3.2.10+aabd4ec_20200313 (a healthy patient), and unfortunately the results look the same wrong (the latter sample): image with similar (just a little better than previous) statistics:

> all.methylation.calls<-read.table(pipe("pzstd -cd aggregated_methylation_calls_all.tsv.zst"), as.is = TRUE,header=TRUE)                         
> mean(all.methylation.calls$log_lik_ratio>2.0)
[1] 0.4757151
> mean(all.methylation.calls$log_lik_ratio>2.5) 
[1] 0.4176991

Is there anything else I can check to understand what is going on?

Best, Maciek

macieksk avatar Apr 02 '20 13:04 macieksk

Do you have any samples that are behaving as expected, or are they all like this?

jts avatar Apr 02 '20 13:04 jts

Currently only these 3, we do not sequence human genomes that often, other samples were obtained in similar conditions.

macieksk avatar Apr 02 '20 14:04 macieksk

Are you using the ENCODE data from the nanopolish methylation tutorial?

jts avatar Apr 02 '20 15:04 jts

I use these scripts and data:

  • ENCODE bisulfite methylation data and comparison scripts obtained from https://nanopolish.readthedocs.io/en/latest/quickstart_call_methylation.html wget http://s3.climb.ac.uk/nanopolish_tutorial/methylation_example.tar.gz
$ ls -al methylation_example*                         
-rw-rw----  1 prom genxone 2143680012 May 24  2018 methylation_example.tar.gz

methylation_example:
total 363476
drwxrws---+  3 prom genxone      4096 Mar 12 11:15 .
drwxrws---+ 10 prom genxone    241664 Apr  2 17:37 ..
-rw-rw----   1 prom genxone 298449481 May 24  2018 albacore_output.fastq
-rw-rw----   1 prom genxone   4585993 Dec  6  2017 bisulfite.ENCFF835NTC.example.tsv
-rwxrwx---   1 prom genxone      2956 Mar 10 14:58 compare_methylation.py
drwxrws---+  2 prom genxone   3379200 Dec  5  2017 fast5_files
-rw-rw----   1 prom genxone  65518244 Nov 24  2017 reference.fasta

The recently introduced change in compare_methylation.py is a required change in the column name:

$ diff methylation_example_org/compare_methylation.py methylation_example/compare_methylation.py 
40c40
<         if int(record["num_cpgs_in_group"]) > 1:
---
>         if int(record["num_motifs_in_group"]) > 1:

macieksk avatar Apr 02 '20 15:04 macieksk

Ok. That encode data is for a very small region of the genome - maybe try to get the full bisulfite run to see if this isn't an issue local to that region?

Jared

jts avatar Apr 02 '20 15:04 jts

Does this sample should be OK? https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2138820 These are methylations from embryonic stem cells, hmm...

Just to make sure: should I download *_CpG_GRCh38.bed.gz file, not *_CHG_GRCh38.bed.gz, or *_CHH_GRCh38.bed.gz or all of them?

macieksk avatar Apr 02 '20 16:04 macieksk

Hm, I' dont know how well ESCs will compare to a norrmal human genome (from blood?). You want the CpG file though.

I'd also suggest this one, as it is the complete set for the sample we use in the tutorial:

https://www.encodeproject.org/files/ENCFF835NTC/

jts avatar Apr 02 '20 16:04 jts

Can this be ok, that median and mean of log_lik_ratio from this sample are such: 1.8514790996785 1.9768107392749 ?

macieksk avatar Apr 02 '20 16:04 macieksk

Hard to say, can you plot a histogram of the LLRs?

jts avatar Apr 02 '20 16:04 jts

Or maybe a histogram of the methylated frequency output

jts avatar Apr 02 '20 16:04 jts

image

macieksk avatar Apr 02 '20 16:04 macieksk

@mike-molnar can you look at one of our internal PromethION runs for a normal human blood sample and see if you see the same level of methylation?

jts avatar Apr 02 '20 16:04 jts

This is the comparison of a healthy human PromethION sample with https://www.encodeproject.org/files/ENCFF835NTC/ all sites data. Still looks overmethylated. image

macieksk avatar Apr 02 '20 17:04 macieksk

plot_methylation_frequency.pdf

We have a similar level of methylation for our normal sample too.

mike-molnar avatar Apr 02 '20 17:04 mike-molnar

Hi @mike-molnar, this looks indeed similar to our sample. Do you have any plots comparing your normal PromethION sample methylation frequency with ENCODE data, such as above, or just with the smaller subset example from https://nanopolish.readthedocs.io/en/latest/quickstart_call_methylation.html ?

macieksk avatar Apr 03 '20 20:04 macieksk

From my experience, methylation profiles can vary significantly between different samples, and even different types of cells from the same sample. To get a meaningful comparison you should only compare nanopolish methylation frequencies against bisulfite methylation frequencies from the same sample and cell type. Are any of the previous comparisons you showed from the same sample and cell type?

mike-molnar avatar Apr 05 '20 00:04 mike-molnar

@mike-molnar All samples I have analysed with Nanopollish call-methylation are human WGS samples from blood. Two of them are sequenced on PromethION, one of them is sequenced on GridION (FLO-MIN106, SQK-LSK108). All of these results have very similar characteristic when comparing to https://www.encodeproject.org/files/ENCFF835NTC/ data. Here is the comparison for the sample sequenced on FLO-MIN106: image

I checked where does the example data comes from, on its dataset details it states: Biosample Type: cell line; Replication type: isogenic.

Do you think this is the reason for the difference? That our samples comes from blood, and the example one is from a cell line?

macieksk avatar Apr 06 '20 15:04 macieksk

Maybe try comparing two of your blood samples against each other?

jts avatar Apr 06 '20 15:04 jts

I guess they will be more similar, will do that in a moment. I started this issue because from the comparison with example I was worried that something is not right with detected methylation levels. They're a bit on a high side. I'm trying to find pure blood samples in ENCODE with no success, there are samples like lymphocyte cell lines, or various tissue samples...

Here is the comparison between our two PromethION WGS human blood samples: image The differences on levels 0 and 1 are probably mostly due to uneven coverage, because there is less of these when I filter for coverage>2.

macieksk avatar Apr 06 '20 16:04 macieksk