megalodon
megalodon copied to clipboard
Barcode support/ demultiplexing
Hi there, Is there a way to demultiplex simultaneously while basecalling for mod bases using Megadolon? :-)
Megalodon does not currently support demultiplexing. There is no time frame for this support at the moment.
Megalodon can take a list of readIDs to analyze, right? So is a possible (somewhat silly) workaround for this to run guppy basecaller and then barcoder separately from megalodon, extract the readIDs from each barcode, and then run megalodon separately for each list?
Having the ability to demultiplex in megalodon would really help with throughput!
Yes. This would certainly be one workaround.
Another workaround (without the overhead of running the basecalling twice) would be to run Megalodon with only per-read outputs including basecalls (e.g. --outputs basecalls per_read_mods
). Then the output basecalls.fastq
could be run through a demultiplexing program. These lists of read ids could then be passed to the megalodon_extras aggregate run
command to produce desired results for each barcode.
Standard bioinformatic analysis could also be used to demultiplex other Megalodon outputs in this workaround. For example mappings
and mod_mappings
produce a BAM output. This could thus be split by read id into a new BAM file for each barcode.
The issue with fully integrated barcoding support is that many output streams would have to be implemented for every output type (even though most would likely never be used; e.g. signal_mappings
). This also adds complexity to an already pretty complex system, likely introducing bugs and more maintenance. Thus fully integrated demultiplexing is not likely to be implemented soon.
I appreciate the complexity, so thank you for the additional possible workarounds!
Thinking about this a bit further, it seems a solution to this might be to add barcode assignment to applicable outputs. New issues can be raised as further barcoding outputs might be requested. This would bypass the issue of opening many output streams while providing the barcoding output with hopefully minimal fuss for downstream processing.
As an initial implementation I would propose adding barcoding results to the sequencing_summary.txt
and mapping_summary.txt
output files and adding a read group to mapping outputs. The read group annotated SAM/BAM/CRAM output would allow splitting into barcode files via samtools split
. While mods
and variants
outputs would not be directly supported (as this would require multiple output streams), using the barcode assignments from the mapping_summary.txt
output would feed directly into the megalodon_extras aggregate run
command (via the read ids option) given the per_read_mods
or per_read_variants
outputs. This proposal would leave out some of the other outputs from barcoding that seem less applicable (signal_mappings
, per_read_refs
).
Does this seem like a sufficient resolution to this issue? Still no timeline for implementation/release, just want to figure out the work involved here.
Sounds reasonable to me for those that want aggregated reads separated by barcode.
I'm also very interested in getting the per_read_modified_base_calls.db
and/or per_read_modified_base_calls.txt
separated by barcode as well. The SAM/BAM/CRAM split-by-barcode you're proposing would allow the info from Mm
and Ml
tags to be separated along with their reads, so demultiplexing those values would be possible with your approach if I'm understanding correctly. But is there a way to demultiplex for the per_read_modified_base_calls
files?
Yes, the proposal would allow the mod_mappings
and mod_basecalls
outputs (with Mm
and Ml
tags) to be separated by barcode.
It might make sense to add a barcode field to the mod and variant database reads table (though this might be integrating barcoding a bit too deeply). In either case a command megalodon_extras modified_bases split_by_barcode
or megalodon_extras modified_bases split_by_read_ids
could be added (similar to megalodon_extras modified_bases split_by_motif
).
Hello,
Does Megalodon now support barcoding? Beside what you already proposed, would it be possible to feed Megalodon with demultiplexed fast5? I was thinking for example to get the fast5 for each barcode with another tool such as fast5_demultiplexer (https://github.com/duceppemo/fast5_demultiplexer), and use this fast5 as input for Megalodon. Thanks!
If you have a list of read IDs (as a .txt file) that correspond to a given barcode, you can feed it into megalodon using
--read-ids-filename $READ_IDS
flag and megalodon will only analyze those reads. You still have to call megalodon separately for each barcode.
There is also more info in #126 (which I have not tried, but looks to be successful)
Thank you for the quick answer! I might try that then. Otherwise, do you see something against running Megalodon from the demultiplexed fast5? This sounds easier for me.
I'm not totally sure whether megalodon works for single fast5 files (which looks like the output from the demultiplexer you linked above). I would look at the megalodon documentation. I think in this situation, you would still need to either feed megalodon each list of fast5 files separately or use the suggestion by Marcus above and at #126 to demultiplex at the aggregation step.
@amauryavril it does look like single fast5 files are supported (https://nanoporetech.github.io/megalodon/common_arguments.html?highlight=single%20fast5#required-argument)
Thank you for looking into it! I will try both methods and see what I can get.