modkit icon indicating copy to clipboard operation
modkit copied to clipboard

Skipped: AUX data not found for methylation

Open vzg100 opened this issue 1 year ago • 9 comments

I'm trying to extract methylation data from bacterial genomes using Dorado 0.5.0 and modkit 2.4 (using the bioconda hosted package).

I've aligned my reads using Dorado align and sorted/indexed my BAM file (shown below). However, when I run modkit I encounter the following error for every read I map: `Skipped: AUX data not found

failed to get modbase info for record b21ac088-3198-4294-a334-8a76070f5c97`.

I'm not really sure how to debug this error - is it just a parameter I'm missing or is there something else?

Alignment script:

dorado aligner assembly.fasta reads.fastq.gz  > sample.bam
samtools sort sample.bam -o sample_sorted.bam
samtools index sample_sorted.bam -o  sample_sorted.bai

Basecalling Script:

It is worth noting the basecalling script basecalls each pod5 seperately and then merges and demultiplexes them. I'm on a cluster so it's easy to run in parallel across multiple GPUs.

dorado basecaller [email protected] sample.pod5 --kit-name kit --modified-bases "5mCG_5hmCG" > sample.bam

Conversion Script:

samtools fastq -T "*" sample.bam > sample.fastq

vzg100 avatar Feb 15 '24 20:02 vzg100

Hello @vzg100,

I was able to replicate the problem. It appears that when you run dorado aligner with fastq input it does not keep the MM/ML/MN tags. Instead of converting to fastq (conversion script), try using the output from dorado directly. You can even skip the alignment step:

dorado basecaller \
  [email protected] sample.pod5 \
  # Use a valid kit name, of course.
  --kit-name kit \
  # pass a reference to basecaller and align at the same time
  --ref assembly.fasta \
  --modified-bases "5mCG_5hmCG" \
  > sample_mapped.bam

# ... samtools sort, index, etc.
modkit .. 

ArtRand avatar Feb 15 '24 22:02 ArtRand

Hi @ArtRand,

Thank you for the quick and thoughtful reply.

My bad if I failed to RTFM, but how do you separate reads by barcode using this approach?

vzg100 avatar Feb 16 '24 03:02 vzg100

Hello @vzg100,

Dorado will handle that as well (barcode docs). Make sure to use the latest version >=0.5.2, before that there was a bug with trimming the modified base tags.

ArtRand avatar Feb 16 '24 14:02 ArtRand

Hi @ArtRand,

Again, thank you for the fast response! Based on my understanding of the dorado docs I don't think that workflow would work particularly well for bacterial assemblies. For each barcode we have a different assembly. So we would need to be able to filter reads by barcode before mapping.

Unless there is a way to filter by barcode in the basecaller?

vzg100 avatar Feb 16 '24 19:02 vzg100

Hello @vzg100,

Maybe I'm missing something obvious, but could you partition your reads based on the BC tag?

$ samtools view -bh --tag BC:<barcode> ${bam} -o ${barcode_bam}

All workflows are different - so I understand if you already have something that's working and don't want to change. You could use modkit repair (docs) to replace the MM/ML/MN tags in your alignments.

ArtRand avatar Feb 17 '24 15:02 ArtRand

Hi @ArtRand,

Thank you for bringing up the BC tag for samtools - I wasn't aware of that!

I'll go ahead and try that out

vzg100 avatar Feb 19 '24 15:02 vzg100

@vzg100 any luck?

ArtRand avatar Feb 21 '24 14:02 ArtRand

Hi @ArtRand,

Thank you for your patience and suggestions - rebasecalling took minute

Unfortunately running the code blow didn't work:

dorado basecaller [email protected] pod5_dir \
    --kit-name ${kit} \
    --reference ${assembly} \
    --modified-bases "5mCG_5hmCG" > sample.bam
samtools view -bh --tag BC:01 sample.bam -o sample_filtered.bam
samtools sort sample_filtered.bam -o sample_sorted_filtered.bam
samtools index sample_sorted_filtered.bam -o  sample_sorted_filtered.bai

What ended up working was demultiplexing with dorado demux and then running the (sorted and indexed) demuxed bam file through modkit. Though this approach involves basecalling twice - first to make the assembly and then second to align the modified bases.

vzg100 avatar Feb 22 '24 19:02 vzg100

Hello @vzg100,

Could you elaborate on what exactly didn't work? Where are you loosing the base modification calls? You should definitely not need to basecall your data twice. At the very least if there is a step where you loose the MM/ML/MN tags you can use modkit repair to project the base modifications back onto the reads.

ArtRand avatar Feb 23 '24 23:02 ArtRand