modkit icon indicating copy to clipboard operation
modkit copied to clipboard

failed to get modbase info for record ..., Skipped: AUX data not found

Open ArnavBharti opened this issue 1 year ago • 22 comments

I am getting this error while running modkit pileup: modkit pileup <bam> <bed>

failed to get modbase info for record <record>, Skipped: AUX data not found

Procedure

  1. nanopolish index -d
  2. minimap2 -a -x map-ont
  3. samtools sort
  4. samtools index

I also ran modbam2bed. A file was formed by it gave methylation nan or 0

What could be the reason so?

ArnavBharti avatar Apr 10 '24 17:04 ArnavBharti

Hello @ArnavBharti,

You must be loosing the MM/ML/MN tags somewhere along the way. Could you check that the records have these tags after each step? If you can tell me the file formats of the outputs and inputs to the steps that could help me debug the problem.

ArtRand avatar Apr 11 '24 13:04 ArtRand

  1. Initially FAST5
  2. After basecalling, FASTQ
  3. Nanopolish index: (input) FAST5, FASTQ; (output) index files .fai, etc.
  4. minimap: (input) FASTA, FASTQ; (output) SAM
  5. samtools: (input) sam (output) bam

ArnavBharti avatar Apr 12 '24 11:04 ArnavBharti

Hello @ArnavBharti,

Could you check that the reads have the MM/ML/MN tags at the end of each step? I have a feeling that your basecalled FASTQ does not have them. For your reference the tags I'm talking about are described in the SAM tags specification in section 1.7.

ArtRand avatar Apr 12 '24 16:04 ArtRand

Hi, could you tell if you check for these tags via text search or is there a tool. Because there exists 'MM' in the FASTq file upon a simple text search but I feel like I am probably missing something.

ArnavBharti avatar Apr 14 '24 13:04 ArnavBharti

Hi, I was also facing the same issue, is there a way to check the tags in the fastq file.

PRIYANKA-22091995 avatar Apr 15 '24 07:04 PRIYANKA-22091995

Hello @ArtRand

It would be nice if you could give insights on the same.

Hi, could you tell if you check for these tags via text search or is there a tool. Because there exists 'MM' in the FASTq file upon a simple text search but I feel like I am probably missing something.

PRIYANKA-22091995 avatar Apr 16 '24 08:04 PRIYANKA-22091995

The SAM/BAM tags as part of the FASTQ file format is not officially supported. The "support" for this is simply via the -y argument to minimap2 which "Copy input FASTA/Q comments to output." (see manual here). Thus the only way to check for the tags in a FASTQ would be text based. I would suggest like the following awk line to print out any rows of the file with the MM tag: awk 'NR %4 == 1 && $0 ~ /MM:Z:/'

If you must use minimap2 ensure that you have the -y option specified to ensure the tags are indeed copied to the output mappings. This is the most likely step where the tags are dropped as long as modified base calling was on at basecalling time (though I think this might require unmapped BAM output).

marcus1487 avatar Apr 16 '24 12:04 marcus1487

Hi,

  1. I used y flag with minimap2 minimap2 -aY -x map-ont and it still gave same error
  2. the awk awk 'NR %4 == 1 && $0 ~ /MM:Z:/' gave no output

As per 2, Is there a problem with the basecalled files?

I'll try witth dorado aligner

ArnavBharti avatar Apr 17 '24 14:04 ArnavBharti

-Y and -y are different options in minimap. In this case you want the lower case one not the upper case one as shown in your command.

marcus1487 avatar Apr 17 '24 14:04 marcus1487

Even with -y same issue

ArnavBharti avatar Apr 18 '24 03:04 ArnavBharti

Given that the awk line gives no results the issue is that you do not have modified base calls in your basecalls. Modified base calls are not available for FASTQ output from Dorado. If you output the default unmapped BAM the modified base calls will be there. You can also have Dorado perform the mapping while basecalling to avoid any of the minimap issues.

I'm not sure what your downstream goals are with running nanopolish etc, but you can convert your unmapped BAM file to FASTQ with the samtools fastq -T "*" command. Let me know if you have any further issues.

marcus1487 avatar Apr 18 '24 16:04 marcus1487

Hello @marcus1487

The basecalling was done using trained RNN model i.e. res_dna_r941_min_modbases-all-context_v001 by using guppy v3.5.1.

PRIYANKA-22091995 avatar Apr 19 '24 09:04 PRIYANKA-22091995

Hello @marcus1487 @ArtRand Art Using guppy v3.5.1, I attempted to do basecalling using the trained RNN model, res_dna_r941_min_modbases-all-context_v001.Following the completion of basecalling, which produced the fastq files, minimap was used to create the bam file. I then ran modkit pileup, however it failed with the error "failed to get modbase info for record \record>." omitted: AUX data not located

Note: Since the data came from the R9 flowcell, guppy was run. Therefore, in order to do the m6A modified basecalling, I utilised the following configuration file: res_dna_r941_min_modbases-all-context_v001

Your opinions on the same would be greatly appreciated.

Thanks

PRIYANKA-22091995 avatar Apr 20 '24 13:04 PRIYANKA-22091995

I don't believe modified base output in FASTQ format is supported. You'll have to specify BAM output format in order to proceed from guppy basecalling.

marcus1487 avatar Apr 20 '24 14:04 marcus1487

I don't believe modified base output in FASTQ format is supported. You'll have to specify BAM output format in order to proceed from guppy basecalling. Hello @marcus1487 In case of this there is no option to specify bam output, by default it will only give fastq. Note-But by normal text search i have observed MM tags in the fastq file.

PRIYANKA-22091995 avatar Apr 20 '24 17:04 PRIYANKA-22091995

Hello @marcus1487 @ArtRand

It would be nice if you could help me with the above issue.

Thanks

PRIYANKA-22091995 avatar Apr 23 '24 08:04 PRIYANKA-22091995

Hello @PRIYANKA-22091995 and @ArnavBharti,

You need to run the basecalling such that you get and unaligned BAM file as output (not FASTQ output). I believe that the --bam_out flag will do this. From there I would align the reads with dorado aligner 0.5.3. Could you tell me when you're able to successfully generate a basecall BAM file with the MM/ML tags? You can check the tags with modkit summary.

ArtRand avatar Apr 23 '24 14:04 ArtRand

@ArnavBharti @PRIYANKA-22091995,

Any update on this?

ArtRand avatar Apr 27 '24 00:04 ArtRand

@ArnavBharti @PRIYANKA-22091995,

Any update on this?

there is no option to put --bam out, and even after putting --bam out, it has only given fastq output. It would be nice to know any suggestion/input for the same.

PRIYANKA-22091995 avatar Apr 30 '24 06:04 PRIYANKA-22091995

Hello @ArtRand The following command was used for modified basecalling for R9 flowcell data: -i <FAST5> -s guppy_output_ed_cc_cq_m6a/ -c ~/workspace/rerio/basecall_models/res_dna_r941_min_modbases-all-context_v001.cfg --recursive --calib_reference ./PlasmoDB-61_Genome.fasta -x cuda:0 -m ~/workspace/rerio/basecall_models/res_dna_r941_min_modbases-all-context_v001.jsn --barcode_kits "EXP-NBD104"

PRIYANKA-22091995 avatar Apr 30 '24 06:04 PRIYANKA-22091995

@ArnavBharti @PRIYANKA-22091995,

Any chance you could update guppy to at least 6.3.2? I downloaded v3.5.1 and don't see the options you'll need. If you're using R9 data, you could probably use dorado also.

ArtRand avatar Apr 30 '24 13:04 ArtRand

@ArnavBharti @PRIYANKA-22091995,

Any chance you could update guppy to at least 6.3.2? I downloaded v3.5.1 and don't see the options you'll need. If you're using R9 data, you could probably use dorado also.

Earlier was trying through guppy v0.6.2, but it failed with a error [guppy/error] The pipeline has shut down prematurely due to an error condition. So, then i moved to the previous version to do modified basecalling for m6A, but again it does not provide option with bam. to do modified basecalling with R9 data, dorado does not have option for m6A.

PRIYANKA-22091995 avatar Apr 30 '24 15:04 PRIYANKA-22091995