dorado icon indicating copy to clipboard operation
dorado copied to clipboard

methylation in PCR samples (not methylated)

Open BioRB opened this issue 2 years ago • 17 comments

Hello, we are running dorado on 2 samples. One native DNA vs the PCR amplified one that "in theory" should not hold the methylation. Why do we get a methylation signal in the PCR sample? The methylation signal is even higher in the PCR compared to the native DNA sample. We have got the same result also with other tools thus my suspect is that there is an issue with the wet part of the experiment. Then I would like to hear your thoughts before drawing my conclusions. Thanks. best, RB

BioRB avatar Oct 03 '23 08:10 BioRB

Hello @BioRB ,

Which methylation model are you using (e.g. 5mCG, 5hmCG, 6mA)?

ArtRand avatar Oct 03 '23 19:10 ArtRand

dear developers, First I used Dorado + modbam2bed and I got those not realistic results. Now I tried with Dorado + modkit and I have better results. The problem is that my results using Dorado + modbam2bad overlap with the results obtained using nanopolish or mcaller (used for different methylations). Thus now I don't know what I have to trust, even if dorado+modkit produces more "realistic " results. I would like to believe that the latest tools (dorado+modkit) are better and give more reliable results .... but in general, we are a bit skeptical because of the huge differences in output using all these different tools. I used an old chemistry flow cell (9.4). I would like your point on it. thanks for your kind help! best rb

PS I used different models for dorado without any improvements

BioRB avatar Oct 04 '23 12:10 BioRB

I have an update. I did a comparison between dorado + modkit and nanopolish for 5mC and I got comparable results. the biggest difference I think is in the threshold to consider to filter the most relevant data. From the really first output from the 2 pipes we have different numbers of sites identified even if the sites overlap (more or less). Thus I can conclude that these new tools ar ok. The only thing I would suggest is to avoid modbam2bed and going directly with modkit. Thanks for the support. best RB

BioRB avatar Oct 05 '23 09:10 BioRB

After a new attempt, I can confirm that Dorado ( like the tools ) produces methylation signals in PCR-amplified samples. The amount of methylation detected is not compatible with background noise that can be taken into account usually. I want your comment on it. Moreover, could you please provide me a dataset of NOT methylated samples to be tested, to reproduce this bias?

BioRB avatar Oct 12 '23 13:10 BioRB

Hello @BioRB.

Did you use the --modified-base option of Dorado? Or just a simplex basecalling ?

Toinax avatar Nov 01 '23 00:11 Toinax

I also have the same problem with dorado+modkit. I get modification calles in PCR products like in PCR-free samples. I used the 10.4.1 chemistry and basecalled the dat in simplex mode without modified-base option.

melachl avatar Jan 24 '24 11:01 melachl

Hello @BioRB and @melachl,

Sorry for being slow to circle back. Could you tell me (again) which model(s) you're using and on what samples? We've tested the models on PCR samples internally and find the false positive rate to be quite low, but not zero. Happy to help.

A

ArtRand avatar Feb 07 '24 00:02 ArtRand

Thank you for your reply! I used dorado basecaller hac,6mA pod5s/ > calls.bam (so this model was used: [email protected]). I looked at mitochondrial DNA from isolated mitochondria of patient skin fibroblasts. I compared sequencing data of sequenced mitochondrial DNA (without PCR amplification) linearized by an Cas9 approach with amplified mitochondrial DNA by long range PCR. The called modifications were nearly identical in both samples.

melachl avatar Feb 07 '24 06:02 melachl

@melachl,

If you run modkit summary $modbam --no-sampling --region chrM on your modBAM, what level of 6mA is predicted in the two samples? My guess is the level of 6mA is within the error of the model. Also the [email protected]_6mA@v2 6mA model (coupled with the SUP basecaller) has the lowest false positive rate. Maybe take the reads aligned to the mitochondrial genome and re-basecall them with this model?

ArtRand avatar Feb 07 '24 22:02 ArtRand

Sorry for the delay. Thank you for your suggestions. I ran modkit summary $modbam --no-sampling --region chrM and as you can see below (2 different samples for native mtDNA and PCR amplified DNA) in the case of methylation on C their is a nice difference between PCR product and native mtDNA but in the case of methylation on A their is no difference. Based on that I do not think that it is within the error of the model.

Native mitochondrial DNA:

modkit summary sorted_mod_SZG122_MitoIso.bam --no-sampling –region chrM not subsampling, using all reads calculating threshold at 10% percentile calculated thresholds: C: 0.50390625 A: 0.6503906 bases C,A total_reads_used 91657 count_reads_C 88984 count_reads_A 91586 pass_threshold_C 0.50390625 pass_threshold_A 0.6503906 base code pass_count pass_frac all_count all_frac C h 682549 0.08602001 953151 0.10815052 C - 5750072 0.7246678 6081571 0.6900534 C m 1502149 0.18931223 1778467 0.20179608 A a 6083958 0.08939469 9350814 0.123836435 A - 61973309 0.9106053 66158576 0.87616354

Native mitochondrial DNA:

modkit summary sorted_SZG146_Barcode17.bam --no-sampling –region chrM not subsampling, using all reads calculating threshold at 10% percentile calculated thresholds: A: 0.6503906 C: 0.7832031 bases A,C total_reads_used 65644 count_reads_A 65644 count_reads_C 65341 pass_threshold_C 0.7832031 pass_threshold_A 0.6503906 base code pass_count pass_frac all_count all_frac C m 1679601 0.39676115 1913738 0.40697646 C h 57661 0.01362088 174421 0.037092455 C - 2496018 0.58961797 2614172 0.5559311 A - 49391795 0.90978444 52765065 0.8764003 A a 4897766 0.090215616 7441514 0.12359968

PCR amplified mitochondrial DNA:

modkit summary sorted_FAX76760_NB17_Blut_SZG5091.bam --no-sampling --region chrM not subsampling, using all reads calculating threshold at 10% percentile calculated thresholds: A: 0.6738281 C: 0.7265625 bases A,C total_reads_used 21985 count_reads_C 21980 count_reads_A 21985 pass_threshold_A 0.6738281 pass_threshold_C 0.7265625 region chrM:0-16569 base code pass_count pass_frac all_count all_frac C m 13152 0.005394336 35567 0.0131578315 C - 2402879 0.9855487 2595382 0.9601484 C h 22082 0.009057004 72156 0.026693746 A a 1953936 0.073953964 3172285 0.10814216 A - 24467040 0.926046 26162112 0.89185786

PCR amplified mitochondrial DNA:

modkit summary sorted_FAX76760_NB19_SZG5231.bam --no-sampling --region chrM not subsampling, using all reads calculating threshold at 10% percentile calculated thresholds: C: 0.7265625 A: 0.6699219 bases C,A total_reads_used 8512 count_reads_A 8512 count_reads_C 8508 pass_threshold_C 0.7265625 pass_threshold_A 0.6699219 region chrM:0-16569 base code pass_count pass_frac all_count all_frac C - 1188639 0.98487115 1284217 0.95880663 C h 11384 0.009432446 36383 0.027163837 C m 6875 0.005696422 18791 0.01402951 A a 1016427 0.0768435 1629063 0.110955946 A - 12210807 0.9231565 13053007 0.88904405

melachl avatar Feb 29 '24 12:02 melachl

Hello @melachl,

Thanks for these numbers. I'm going to run some tests internally and compare. Just to confirm, you're using the SUP basecaller model and [email protected]_6mA@v2 6mA model, correct?

ArtRand avatar Mar 01 '24 17:03 ArtRand

No I used hac ([email protected]) because for sup I only got a few reads per barcode and I don't know why. But in the case for methylation at C there is a clear difference between PCR amplicons and native DNA when using hac so I thought the accuracy should also be enough for 6mA.

melachl avatar Mar 04 '24 08:03 melachl

Hello @melachl,

Sorry for the delay. I ran a test on some PCR and native DNA. I used summary to get an idea of the changes in 6mA in PCR versus native reads:

The command template is:

$ modkit summary ${merged_sorted_bam} --no-sampling --only-mapped
# bases             C,A 
# total_reads_used  793 
# count_reads_A     793 
# count_reads_C     792 
# pass_threshold_A  0.7871094 
# pass_threshold_C  0.76171875 
 base  code  pass_count  pass_frac     all_count  all_frac 
 A     -     494323      0.9833575     534270     0.95895624 
 A     a     8366        0.016642496   22867      0.041043766 
 C     -     45847       0.99598104    50235      0.9852511 
 C     h     101         0.002194126   500        0.009806422 
 C     m     84          0.0018248175  252        0.0049424362 
# bases             C,A 
# total_reads_used  362 
# count_reads_A     362 
# count_reads_C     361 
# pass_threshold_C  0.6386719 
# pass_threshold_A  0.6347656 
 base  code  pass_count  pass_frac    all_count  all_frac 
 C     h     773         0.026936613  1538       0.04827067 
 C     -     25443       0.88660836   27093      0.85032326 
 C     m     2481        0.08645503   3231       0.10140606 
 A     a     41499       0.08555031   63401      0.117845945 
 A     -     443584      0.9144497    474598     0.88215405 

Using the HAC basecalls and the HAC v2 6mA model:

# bases             C,A 
# total_reads_used  788 
# count_reads_A     788 
# count_reads_C     785 
# pass_threshold_A  0.7285156 
# pass_threshold_C  0.921875 
 base  code  pass_count  pass_frac     all_count  all_frac 
 C     -     45531       0.9977211     49702      0.98083794 
 C     h     68          0.0014900843  594        0.011722219 
 C     m     36          0.0007888682  377        0.0074398597 
 A     a     17793       0.03574478    36288      0.065662764 
 A     -     479986      0.9642552     516354     0.93433726
# bases             A,C 
# total_reads_used  349 
# count_reads_C     346 
# count_reads_A     349 
# pass_threshold_C  0.6779299 
# pass_threshold_A  0.6191406 
 base  code  pass_count  pass_frac    all_count  all_frac 
 A     a     51301       0.11001338   72607      0.14067034 
 A     -     415015      0.88998663   443543     0.85932964 
 C     m     2414        0.071928725  3387       0.090826206 
 C     h     1071        0.03191204   2110       0.05658202 
 C     -     30076       0.89615923   31794      0.85259175

tl;dr

  • Use the 6mA@v2 6mA model.
  • I'm seeing about an 8% change in PCR vs native samples in the highest performance (SUP) combination.
  • There is a ~8% FPR with SUP basecalls and 11% FPR with HAC, we expect about 4% FPR with SUP so I'm investigating this further.
  • You should see a larger difference between PCR and native mtDNA than you're seeing. I'm also concerned that you're not able to use the SUP basecaller model, so you might want to dig into what is going on there.

ArtRand avatar Mar 08 '24 00:03 ArtRand

Okey! Thank you very much!

I will try that out!

It is a little bit strange for the basecalling with sup because it is multiplexed that with a few houndred reads per barcode but after basecalling with sup there are only 7 reads per barcode left....

melachl avatar Mar 08 '24 12:03 melachl

Hi @melachl can you share details of how you demultiplexed the data?

tijyojwad avatar Mar 08 '24 23:03 tijyojwad

I did it using this command: $ dorado demux --kit-name --output-dir

melachl avatar Mar 11 '24 06:03 melachl

thanks!

can you try to run the basecaller command with --no-trim and then demux?

alternatively, you can pass the --kit-name to the dorado basecaller cmd to classify in line with basecalling and then use dorado demux --no-classify --ouptut-dir to split into subdirectories?

tijyojwad avatar Mar 11 '24 14:03 tijyojwad