xpore icon indicating copy to clipboard operation
xpore copied to clipboard

xpore processing of Epinano Yeast fails

Open cathoderaymission opened this issue 2 years ago • 10 comments

I downloaded and am using the latest xpore 2.0 release off github.

  1. Basecalled all 3 replicates of each of the WT and KO Yeast SK1 from the epinano paper.

  2. Used the data preparation in the documentation, however I did not align to the transcriptome, I aligned to the Yeast SK1 genome which was used in the original paper, and for all the other testing I've done works fine.

  3. Ran nanopolish index and eventalign.

  4. I ran xpore dataprep like this, repeating for every sample: xpore dataprep --eventalign nanopolish/WT1-eventalign.txt --out_dir xpore_wt1 note: There is no GTF file anywhere for Yeast SK1, only GFF which is not supported, so I could not provide any of the other options. Converting files from GFF3 to GTF is problematic.

  5. I ran this xpore diffmod as: xpore diffmod --config yeast-checkpoint.yaml

Using the following config file:

    KO:
        rep1: ./xpore_ko1
        rep2: ./xpore_ko2
        rep3: ./xpore_ko3
    WT:
        rep1: ./xpore_wt1
        rep2: ./xpore_wt2
        rep3: ./xpore_wt3
    
out: ./out-yeast

The resulting output looks like the following with a little over 19,000 entries:

chr12,44866,ATTCT,-0.061132867455457124,0.6470317910992184,-0.4578895553616586,,,0.22520594482917297,0.2863388122846301,,,,,15.0,33.0,,,81.17148481193192,83.69607757395315,2.021410114674792,12.578018904768552,0.8209402255297766,0.3205763517774658,higher
chr12,44922,CGTGA,-0.0037339510572334395,0.9648676459550333,-0.04404611387649633,,,0.0766995031589563,0.08043345421618973,,,,,14.999999999999995,30.0,,,91.58069751978103,74.47065218580812,13.285658785431545,237.06049496403168,0.8132354815877638,0.04528508876218666,lower
chr12,44826,TTCTA,-0.007413209155195888,0.9585701351525215,-0.05194799028563114,,,0.29773667806174525,0.30514988721694114,,,,,15.000000000000002,32.99999999999999,,,79.42891914527769,82.72170169193576,2.0448814576165955,8.464173973308151,0.9918744938306072,0.11398503460662153,higher
chr12,44900,TTACA,0.19734040796833324,0.054835170173881724,1.9201799746588704,,,0.1973716560153303,3.124804699706269e-05,,,,,15.0,32.0,,,81.07283291441756,78.01502841592193,1.8461830212386872,6.249656159679125,0.5700243253289478,0.3744968020236349,lower

However this second column value does not ever get above a value of 80,000 regardless of the reported chromosome, and each chromosome in this genome is very large.

The chromosome and position do point to the reported motif, eg chr12:44864-44869 is indeed ATTCT (from entry 1 in the above table sample), but again the diffmod table never goes past roughly position 80,000 on any chromosome.

Also, in addition to this, would it be possible to add options for xpore to:

  1. Process GFF files. GTF files are old and GFF3 has been the standard for a while.
  2. Report on individual read ids which match some user defined methylation position found so we can determine reads which are methylated?
  3. Can you provide the full GTF/genome/transcriptome files you used for processing the HEK293 data in your paper (not the pyensembl data)?

Thanks!

cathoderaymission avatar Aug 03 '21 09:08 cathoderaymission

Hi @cathoderaymission thanks for the detailed description, we will reply shortly with more details. In the meantime I wanted to ask if you can raise separate issues for (1), (2), and (3)? We can then reply or keep them active for future releases. Thank you!

jonathangoeke avatar Aug 12 '21 01:08 jonathangoeke

Hi @cathoderaymission regarding your main issue here, can you run xPore with the reads that are aligned to the transcriptome reference instead of the genome reference and check if these results have valid coordinates? Nanopolish will not work well with RNA-Seq reads aligned to the genome

jonathangoeke avatar Aug 13 '21 06:08 jonathangoeke

Hi @jonathangoeke, I could do that but it would take 3-4 days to reprocess all the data, so are you sure this is necessary?

I'll note that these are not spliced alignments using "-k14", they are directly aligned to the genome using "minimap2 -ax map-ont"

I'm guessing the 80k limit on the chromosomes has something to do with these warnings with the dataprep command: xpore-2.0-py3.6.egg/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance

I looked through your code to see if there was a simple answer such as a python "range" which caused this limit, but I couldn't find anything.

cathoderaymission avatar Aug 13 '21 08:08 cathoderaymission

Hi @cathoderaymission I would suggest to run xPore with the recommended options first, and f that still returns results which do not cover the expected coordinates, you could post a minimal reproducible example so that we can understand where the error comes from. Thanks

jonathangoeke avatar Aug 16 '21 02:08 jonathangoeke

Hi @cathoderaymission, I tried xPore on Epinano Yeast dataset without genome-to-transcription option and the max position I got on chr12 is 165479. The minimap option I used was "minimap2 -ax splice -u f -k 8 -t 12 -o wt1.sam Yeast_sk1.fa wt1.fastq", the "wt1" stands for wide type replicate 1. I split minimap2 output file based on strand and filtered out unmapped, secondary and supplementary alignments and did each chromosome and strand individually.

Raarbiarsan1899 avatar Aug 16 '21 21:08 Raarbiarsan1899

Hi @jonathangoeke, I have a question. I noticed in https://github.com/GoekeLab/xpore/blob/master/xpore/scripts/dataprep.py line 117:

cond_successfully_eventaligned = eventalign_result['reference_kmer'] == eventalign_result['model_kmer']
    if cond_successfully_eventaligned.sum() != 0:

events are filtered out when reference_kmer is not equal to model_kmer, will this bias the output? because modified base may have signal closer to model kmer than reference kmer. Thank you!

Raarbiarsan1899 avatar Aug 16 '21 23:08 Raarbiarsan1899

@jonathangoeke

  1. Basecalled epinano-yeast KO1-3 and WT1-3 using guppy 4.5.4
  2. minimap2 -ax map-ont --secondary=no -t 24 Yeast_sk1.fa each sample. Yeast_sk1.fa is from https://github.com/enovoa/EpiNano/blob/master/Reference_sequences/Yeast_sk1.fa
  3. samtools sort sample.sam > sample.bam then samtools index sample.bam
  4. nanopolish index -d /path/to/sample/fast5files KO1-basecalled.fastq where all the fastq files from guppy for each sample were combined into a seperate fastq (eg: KO1-basecalled.fastq)
  5.       nanopolish eventalign --reads KO1-basecalled.fastq \                          
        --bam KO1-filtered.bam \
        --genome Yeast_sk1.fa \
        --signal-index \
        --scale-events \
        --summary nanopolish/KO1-summary.txt \                                 
        --threads 32 > nanopolish/KO1-eventalign.txt ```
    

for each seperate sample, KO1-3 and WT1-3 6. xpore diffmod --config yeast-checkpoint.yaml --n_processes 32

This produced the above data from my original submissions.

cathoderaymission avatar Aug 18 '21 11:08 cathoderaymission

Hi @jonathangoeke, I have a question. I noticed in https://github.com/GoekeLab/xpore/blob/master/xpore/scripts/dataprep.py line 117:

cond_successfully_eventaligned = eventalign_result['reference_kmer'] == eventalign_result['model_kmer']
    if cond_successfully_eventaligned.sum() != 0:

events are filtered out when reference_kmer is not equal to model_kmer, will this bias the output? because modified base may have signal closer to model kmer than reference kmer. Thank you!

Hi @Raarbiarsan1899 it might be good to submit a separate issue for this point (unless that is related to the original comment?) Thanks!

jonathangoeke avatar Aug 19 '21 06:08 jonathangoeke

Hi @cathoderaymission have you tried aligning the data to the yeast transcriptome? Ensembl probides such files for example here: http://ftp.ensembl.org/pub/release-104/fasta/saccharomyces_cerevisiae/?

jonathangoeke avatar Aug 19 '21 06:08 jonathangoeke

@jonathangoeke, the epinano paper uses the SK1 strain which is why they provided the fasta file and directly align to it without using spliced alignments. Unfortunately there is no transcriptome, however the closest GFF file I could find is from below and this could be used to make a transcriptome with "gffread" from cufflinks.

https://figshare.com/articles/dataset/Genome_SC_SK1/1265022/1

However there is no GTF file and converting GFF to GTF is problematic.

cathoderaymission avatar Aug 19 '21 06:08 cathoderaymission