megalodon icon indicating copy to clipboard operation
megalodon copied to clipboard

Instruction about mod_mapping format

Open vahidAK opened this issue 2 years ago • 5 comments

Hi @marcus1487 ,

It seems that the bam file from mod_mapping output does not contain the actual sequence of the read, instead, the actual sequence of the read is replaced by the part of the reference genome that the read mapped into it?

Is it correct?

Thanks, Vahid

vahidAK avatar Feb 07 '22 20:02 vahidAK

Yes. This is correct.

The mod_mappings format is also referred to as "reference-anchored modified base calling". The alternative is the "basecall-anchored modified basecalling" and is available with the mod_basecalls output type. Reference-anchored modified basecalling assumes that the reference mapping sequence is correct for a read and uses this sequence as input to the modified base calling algorithms instead of the basecalled sequence.

For recent data and models (10.4 pore+sup model / Q20+) this makes very little difference since the sequences are generally largely the same, but for fast models and older data providing the reference sequence could greatly improve the accuracy of the modified base calls.

The mod_basecalls output matches the files currently produced by Bonito and Guppy.

marcus1487 avatar Feb 07 '22 20:02 marcus1487

Thanks @marcus1487 ,

So, is it possible to get a bam output from megalodon with the base modification tag and the actual basecalled sequence of the read? I mean sth like mod_mapping but having the actual sequence of the read and not the ref.

vahidAK avatar Feb 07 '22 20:02 vahidAK

Yes. This is the mod_basecalls format, but note that this is a un-mapped BAM file.

marcus1487 avatar Feb 07 '22 20:02 marcus1487

Thanks @marcus1487

Would it be possible to get a version of the mod_basecalls BAM file that has been mapped?

Ultimately what we want is to have a single BAM containing haplotypes (to be added after megalodon runs), methylation calls and sequence alignments. This would also be a single file that could be loaded into IGV for viewing of these data types. Right now we're having to keep multiple copies of BAM files and methylation call information, which is cumbersome and quite inefficient in terms of space usage.

Right now minimap2 won't accept a BAM file as input, so transferring over the methylation tags to a mapped BAM would be somewhat cumbersome to do. It definitely seems like something that would be better supported directly in Megalodon.

oneillkza avatar Feb 07 '22 21:02 oneillkza

This output is supported in Bonito and Guppy (Remora Guppy support is under active development and will be included in a future Guppy release). In terms of including this in megalodon, I'm a bit worried that this output would confuse the megalodon outputs further (having unmapped_mod_basecalls, mapped_mod_basecalls, and mod_mappings). I think the native support for mapping this format would be better suited within minimap2, but look forward to the minimap2 developers on this point. I've just opened a ticket for this output in minimap2 (https://github.com/lh3/minimap2/issues/870).

marcus1487 avatar Feb 07 '22 22:02 marcus1487