tombo icon indicating copy to clipboard operation
tombo copied to clipboard

Tombo detect modifications de novo with corrected reads and reproducibility

Open FraPria opened this issue 3 years ago • 2 comments

Hello,

I’m using Tombo to detect modifications ‘de novo’ on direct RNA data. I’m using the following commands (as described in MINES pipeline since the idea is to use MINES afterwards https://github.com/YeoLab/MINES )

tombo preprocess annotate_raw_with_fastqs --fast5-basedir $FAST5_single --fastq-filenames $FASTQ --overwrite --processes $NCPUS
tombo resquiggle $FAST5_single $genome --overwrite --num-most-common-errors 5 --processes $NCPUS
tombo detect_modifications de_novo --fast5-basedirs $FAST5_single --statistics-file-basename $stats --processes $NCPUS
tombo text_output browser_files --fast5-basedirs $FAST5_single --statistics-filename $stats --browser-file-basename $out --file-types coverage fraction

I have a couple of questions:

  1. I’m using proovread to correct nanopore reads with illumina reads. This means that the fastq bases are edited and no more equal to the guppy basecalled fastq. I was wondering whether this could impact any of the tombo processing steps (preprocess/resquiggle/detect_modification)

  2. I’ve run twice (in series, not in parallel) the forehead said commands on the same fastq and fast5. But the results are slighly different (different sizes and number of rows in coverage and fraction modified files). Is it possible to set a seed in order to make reproducible results? Here there are the resquiggle summaries for the 2 run on the same input:

RUN 1

     7.6% (   7597 reads) : Alignment not produced                                                          
     6.4% (   6449 reads) : Fastq slot not present in --basecall-group                                      
     6.0% (   5996 reads) : Poor raw to expected signal matching (revert with `tombo filter clear_filters`) 
     0.5% (    508 reads) : Reference mapping contains non-canonical bases (transcriptome reference cannot contain U bases)
     0.4% (    399 reads) : Read event to sequence alignment extends beyond bandwidth                       
     0.0% (      2 reads) : Not enough raw signal around potential genomic deletion(s)    

RUN 2

     7.6% (   7597 reads) : Alignment not produced                                                          
     6.4% (   6449 reads) : Fastq slot not present in --basecall-group                                      
     6.0% (   5988 reads) : Poor raw to expected signal matching (revert with `tombo filter clear_filters`) 
     0.5% (    508 reads) : Reference mapping contains non-canonical bases (transcriptome reference cannot contain U bases)
     0.4% (    408 reads) : Read event to sequence alignment extends beyond bandwidth                       
     0.0% (      1 reads) : Not enough raw signal around potential genomic deletion(s)   

Thanks a lot

FraPria avatar Dec 11 '20 09:12 FraPria

Hi @FraPria . I'm no expert at Tombo but I'll try to help.

Question 1: It sounds like you're familiar with the difference between basecalling and resquiggling. When a fast5 is basecalled, you get fastq files as output. When you run tombo preprocess annotate_raw_with_fastqs, what you're doing is copying the those fastq files into the fast5 files. When you're done with that, you can do whatever you want to those ".fastq" files that originated during basecalling, and it won't make a difference. I am unfamiliar with proovread.

Question 2: I'm unaware of any options to set random seeds. (I'm not even sure if it's stochastic.) With some Tombo commands, like Tombo resquiggle, you can run tombo resquiggle --print-advanced-arguments to see options that don't show up when you run tombo resquiggle --help. There's nothing there about seeds, though.

SycamoreLeaf avatar Dec 17 '20 19:12 SycamoreLeaf

There are no steps internal to Tombo which are stochastic/random, so there are no seed settings. My best guess would be that the minimap interface is return multi-mapping read mappings in different orders. Tombo only takes the first mapping and discards the rest, so this could lead to this result. This also fits with the observation that the number of mapped reads is the same. All other differences could arise from different mapping locations for the same reads. I don't unfortunately have a way to resolve this issue directly.

marcus1487 avatar Dec 17 '20 22:12 marcus1487