modkit icon indicating copy to clipboard operation
modkit copied to clipboard

100% skipped reads

Open Hedi65 opened this issue 3 months ago • 12 comments

Dear developer

All of my reads were skipped after using modkit. Here are my steps.

  1. The data were re-basecalled (using Dorado) in SUP mode $ data/dorado-0.9.1-linux-x64/bin/dorado basecaller --device cuda:all --no-trim --modified-bases-models /mnt/DataStore/Tools/dorado-0.9.1_profiles/V5/[email protected]_4mC_5mC@v3 /mnt/DataStore/Tools/dorado-0.9.1_profiles/V5/[email protected] /mnt/DataStore/all_3_first_run > 4mC_5mC_SUP

  2. The reads below Q 10 were discarded, and high-quality reads barcoded using Dorado demux.

  3. Reads from each barcode were mapped against the reference genome using minimap2.

  4. Bam files were sorted and indexed using samtools.

  5. modkit was used on the indexed BAM file: $modkit pilup B01.sorted.bam B01.bed --log-filepath pileup.log

calculated chunk size: 6, interval size 100000, processing 600000 positions concurrently attempting to sample 10042 reads Done, processed 0 rows. Processed ~0 reads and skipped ~269075 reads.

And

$ samtools view -F 4 B01.sorted.bam | wc -l 439548

And

$ samtools view B01.sorted.bam | grep -m 1 -E '\sMM:Z:' || echo "no MM tags found" no MM tags found

By looking at the log file, I think the issue is MM tags.

[modkit-core/src/reads_sampler/mod.rs::126][2025-09-07 01:46:49][DEBUG] sampled 0 records [modkit-core/src/interval_chunks.rs::531][2025-09-07 01:46:49][DEBUG] there are 1958 contig(s) to work on (1958 parts) [modkit-core/src/read_cache.rs::329][2025-09-07 01:46:49][DEBUG] 9d21257a-e035-436c-965d-e23f77e0838e: MM-tag-missing [modkit-core/src/read_cache.rs::329][2025-09-07 01:46:49][DEBUG] f7f4e4b5-8917-4391-b9f9-9b253463250f: MM-tag-missing [modkit-core/src/read_cache.rs::329][2025-09-07 01:46:49][DEBUG] 6a79da85-de55-4b01-996b-4eb8b1115697: MM-tag-missing

I suppose the issue is related to my basecalling, which I followed here https://github.com/nanoporetech/dorado/issues/1302#issue-2955430954

I got the same issue when I used the Dorado aligner too (instead of Minimap2).

Do you have any idea how I can overcome this issue?

Thanks for your help

Hedi65 avatar Sep 07 '25 00:09 Hedi65

Update

I noticed that ($ samtools view B01.sorted.bam | grep -m 1 -E '\sMM:Z:' || echo "no MM tags found"), my raw reads before de-multiplexing using Durado, do contain modified bases. However, the demux function will drop those modified bases. I used the Rapid Barcoding Kit for my sequencing, and de-multiplexing is quite important for me. Does it mean that de-multiplexing is not compatible with modified base calling?

Alternatively, can I run modkit pilup on my raw data before de-multiplexing, and then from the bed file, only pick reads whose quality is over 10 (Q is reported in the BAM file from Dorado)? If so, then I can probably collect reads for each barcode from the bed file; I just need to match the read ID with the one that was demultiplexed during the process.

Appreciate your comment.

Hedi65 avatar Sep 07 '25 01:09 Hedi65

Hello @Hedi65, sorry about the delay.

However, the demux function will drop those modified bases. I used the Rapid Barcoding Kit for my sequencing, and de-multiplexing is quite important for me. Does it mean that de-multiplexing is not compatible with modified base calling?

Barcoding is certainly compatible with modified base calling.

I don't think that dorado demux will discard base modification tags. Here is a way to check what's going on: after each step, run modkit modbam check-tags $bam --head 200 this command will report out how many reads have errors and exit non-0 if so. If the demuxed BAMs are missing tags (when the input BAM had them), please bring that up with the dorado team since these tags should be preserved. You could use modkit repair to put the tags back (docs).

If I had to guess, I bet you're loosing the tags at the minimap step, could you try using dorado aligner instead?

Alternatively, can I run modkit pilup on my raw data before de-multiplexing, and then from the bed file, only pick reads whose quality is over 10 (Q is reported in the BAM file from Dorado)?

Pileup doesn't report read-level information so you can't demux the bedMethyl like this. I'd strongly recommend figuring out which step is loosing the tags. If it's the demux step - that should be fixed in dorado and can be patched over with modkit repair. If you're loosing the tags after alignment with minimap, use Dorado aligner instead.

ArtRand avatar Sep 10 '25 23:09 ArtRand

Thanks for the reply

Here is the outcome after demux

/mnt/DataStore/Processed_data/User/Bir/dist_modkit_v0.5.0_5120ef7/modkit modbam check-tags /mnt/DataStore/Processed_data/User/Bir/Barcoded_4mc_5mc_bam_not_filtered/unknown_run_id_SQK-RBK114-24_barcode01.bam --head 200

checking tags on first 200 reads input modBAM contains 200 (100.00%) failed records num PASS records: 0 (0.00%) num records: 200 errors: +----------------+-------+--------+ | error | count | pct | +----------------+-------+--------+ | MM-tag-missing | 200 | 100.00 | | total | 200 | 100 | +----------------+-------+--------+

modified bases: +--------+--------------+----------+------+ | strand | primary_base | mod_code | mode | +--------+--------------+----------+------+ +--------+--------------+----------+------+

Here is the outcome before demux

/mnt/DataStore/Processed_data/User/Bir/dist_modkit_v0.5.0_5120ef7/modkit modbam check-tags /mnt/DataStore/Processed_data/User/Bir/BAM_fastq_from_Dorado/4mC_5mC_SUP.bam --head 200

checking tags on first 200 reads no errors num PASS records: 200 (100.00%) num records: 200 valid record tag headers: +------------+-------+ | tag_header | count | +------------+-------+ | C+21839. | 200 | | C+m. | 200 | +------------+-------+

modified bases: +--------+--------------+----------+------+ | strand | primary_base | mod_code | mode | +--------+--------------+----------+------+ | + | C | 21839 | . | | + | C | m | . | +--------+--------------+----------+------+

I just followed what Dorado demux suggested; I even enabled the no-trim option as well as no filtration (quality or length) at the demux level.

I passed the basecalled BAM file before demux to the dorado aligner, and then I used modkit Pileup . It worked, and a 130 GB bed file was generated. Yet at this moment, we can't see which reads belong to which barcode.

Hedi65 avatar Sep 12 '25 13:09 Hedi65

Hi @Hedi65,

I'm not able to reproduce mod tags being stripped by dorado demux. Using the test data from the dorado repo, and using various versions from 0.9.1 to 1.1.1:

dorado basecaller --no-trim hac,5mC_5hmC \
    tests/data/barcode_demux/read_group_test/SQK-RBK114_BC01_BC04_unclassified.pod5 > calls.bam
dorado demux --kit-name SQK-RBK114-96 -o demuxed calls.bam
samtools view demux-091/0d85015e-6a4e-400c-a80f-c187c65a6d03_SQK-RBK114-96_barcode01.bam | grep -m 1 -E '\sMM:Z:' | cut -f 1

gives

1311ecda-0649-46ad-988d-307a5f4e3bd6

i.e. it's found a read with the MM tags present.

Can you please provide your exact commands for each step?

malton-ont avatar Sep 12 '25 14:09 malton-ont

Hello @Hedi65, sorry about the delay.

However, the demux function will drop those modified bases. I used the Rapid Barcoding Kit for my sequencing, and de-multiplexing is quite important for me. Does it mean that de-multiplexing is not compatible with modified base calling?

Barcoding is certainly compatible with modified base calling.

I don't think that dorado demux will discard base modification tags. Here is a way to check what's going on: after each step, run modkit modbam check-tags $bam --head 200 this command will report out how many reads have errors and exit non-0 if so. If the demuxed BAMs are missing tags (when the input BAM had them), please bring that up with the dorado team since these tags should be preserved. You could use modkit repair to put the tags back (docs).

If I had to guess, I bet you're loosing the tags at the minimap step, could you try using dorado aligner instead?

Alternatively, can I run modkit pilup on my raw data before de-multiplexing, and then from the bed file, only pick reads whose quality is over 10 (Q is reported in the BAM file from Dorado)?

Pileup doesn't report read-level information so you can't demux the bedMethyl like this. I'd strongly recommend figuring out which step is loosing the tags. If it's the demux step - that should be fixed in dorado and can be patched over with modkit repair. If you're loosing the tags after alignment with minimap, use Dorado aligner instead.

Thanks for your reply. My data is coming from 3 different runs; something happened during the merging of the BAM files from other runs. After going through all the steps, I can confirm that modified bases are also detected after demux.

Hedi65 avatar Sep 20 '25 23:09 Hedi65

Hi @Hedi65,

I'm not able to reproduce mod tags being stripped by dorado demux. Using the test data from the dorado repo, and using various versions from 0.9.1 to 1.1.1:

dorado basecaller --no-trim hac,5mC_5hmC \
    tests/data/barcode_demux/read_group_test/SQK-RBK114_BC01_BC04_unclassified.pod5 > calls.bam
dorado demux --kit-name SQK-RBK114-96 -o demuxed calls.bam
samtools view demux-091/0d85015e-6a4e-400c-a80f-c187c65a6d03_SQK-RBK114-96_barcode01.bam | grep -m 1 -E '\sMM:Z:' | cut -f 1

gives

1311ecda-0649-46ad-988d-307a5f4e3bd6

i.e. it's found a read with the MM tags present.

Can you please provide your exact commands for each step?

basecalling

/dorado-0.9.1-linux-x64/bin/dorado basecaller --device cuda:all --no-trim --min-qscore 10 --modified-bases-models /path/to/modified/basecalling/profile /path/to/basecalling/profile /mnt/DataStore/Raw_data/pod5_all_3_first_run > 4mC_5mC_SUP

barcoding

/dorado-0.9.1-linux-x64/bin/dorado demux --kit-name SQK-RBK114-24 --output-dir /mnt/DataStore/Processed_data/4mC_5mC_SUP_barcoded /mnt/DataStore/Processed_data/4mC_5mC_SUP.bam --no-trim

Hedi65 avatar Sep 20 '25 23:09 Hedi65

Thanks for your reply. My data is coming from 3 different runs; something happened during the merging of the BAM files from other runs. After going through all the steps, I can confirm that modified bases are also detected after demux.

Can you confirm that the tags are present before and after each dorado command - please provide every command you are running, and check the state of the tags at each step. If the tags are present after demux then the most likely thing I can see is that the tags are stripped due to hard-clipping during alignment. In this case, you can turn on soft-clipping with the --mm2-opts -Y parameter to dorado aligner (or just -Y to minimap2).

malton-ont avatar Sep 22 '25 07:09 malton-ont

Thanks for your reply. My data is coming from 3 different runs; something happened during the merging of the BAM files from other runs. After going through all the steps, I can confirm that modified bases are also detected after demux.

Can you confirm that the tags are present before and after each dorado command - please provide every command you are running, and check the state of the tags at each step. If the tags are present after demux then the most likely thing I can see is that the tags are stripped due to hard-clipping during alignment. In this case, you can turn on soft-clipping with the --mm2-opts -Y parameter to dorado aligner (or just -Y to minimap2).

Thanks for your reply. The basecalling and barcoding commands are mentioned above. Bam files corresponding to the same esample across different runs were merged with samtools

$ samtools merge -@ 8 "$out" $files

And each merged BAM file was mapped against the bovine reference genome using the Dorado aligner.

$ /dorado aligner /mnt/DataStore/Bovine.fa /mnt/DataStore/demux_4mC_5mC_Q10_1/merged_barcode01.bam --emit-summary -t 20 -o /mnt/DataStore/Processed_data/4mc_5mc_mapping_against_ref_B01

checking tags right after basecalling

$ /mnt/DataStore/A_dist_modkit_v0.5.1_8fa79e3/modkit modbam check-tags /mnt/DataStore/Basecalled/4mC_5mC_SUP_Q10 --head 200

checking tags on first 200 reads no errors num PASS records: 200 (100.00%) num records: 200 valid record tag headers: +------------+-------+ | tag_header | count | +------------+-------+ | C+21839. | 200 | | C+m. | 200 | +------------+-------+

checking tags after merging the barcoded BAM files

$ /mnt/DataStore/A_dist_modkit_v0.5.1_8fa79e3/modkit modbam check-tags /mnt/DataStore/demux_4mC_5mC_Q10_1/merged_barcode01.bam --head 200

checking tags on first 200 reads no errors num PASS records: 200 (100.00%) num records: 200 valid record tag headers: +------------+-------+ | tag_header | count | +------------+-------+ | C+21839. | 200 | | C+m. | 200 | +------------+-------+

**checking tags after mapping the merged BAM file against the ref genome **

$ /mnt/DataStore/A_dist_modkit_v0.5.1_8fa79e3/modkit modbam check-tags /mnt/DataStore/4mc_5mc_mapping_against_ref_B01/merged_barcode01.bam --head 200

checking tags on first 200 reads no errors num PASS records: 200 (100.00%) num records: 200 valid record tag headers: +------------+-------+ | tag_header | count | +------------+-------+ | C+21839. | 200 | | C+m. | 200 | +------------+-------+

modified bases: +--------+--------------+----------+------+ | strand | primary_base | mod_code | mode | +--------+--------------+----------+------+ | + | C | 21839 | . | | + | C | m | . | +--------+--------------+----------+------+

Hedi65 avatar Sep 22 '25 08:09 Hedi65

Hi @Hedi65,

So, to confirm:

  1. dorado basecaller 🆗
  2. dorado demux 🆗
  3. samtools merge 🆗
  4. dorado aligner ❌ ?

Your current dorado aligner command will strip modbase tags as it is hard-clipping. This command needs the --mm2-opts -Y flags to turn on soft-clipping as mentioned above:

/dorado aligner \
    /mnt/DataStore/Bovine.fa \
    /mnt/DataStore/demux_4mC_5mC_Q10_1/merged_barcode01.bam
    --emit-summary \
    -t 20 \
    --mm2-opts -Y \
    -o /mnt/DataStore/Processed_data/4mc_5mc_mapping_against_ref_B01

malton-ont avatar Sep 22 '25 08:09 malton-ont

Hi @Hedi65,

So, to confirm:

  1. dorado basecaller 🆗
  2. dorado demux 🆗
  3. samtools merge 🆗
  4. dorado aligner ❌ ?

Your current dorado aligner command will strip modbase tags as it is hard-clipping. This command needs the --mm2-opts -Y flags to turn on soft-clipping as mentioned above:

/dorado aligner \
    /mnt/DataStore/Bovine.fa \
    /mnt/DataStore/demux_4mC_5mC_Q10_1/merged_barcode01.bam
    --emit-summary \
    -t 20 \
    --mm2-opts -Y \
    -o /mnt/DataStore/Processed_data/4mc_5mc_mapping_against_ref_B01

thanks again

After enabling the --mm2-opts -Y flag, I can confirm that tags are present, although they are also present without the flag.

$ /mnt/DataStore/A_dist_modkit_v0.5.1_8fa79e3/modkit modbam check-tags /mnt/DataStore/Processed_data/4mc_5mc_mapping_against_ref_B01/merged_barcode01.bam --head 200

checking tags on first 200 reads no errors num PASS records: 200 (100.00%) num records: 200 valid record tag headers: +------------+-------+ | tag_header | count | +------------+-------+ | C+21839. | 200 | | C+m. | 200 | +------------+-------+

modified bases: +--------+--------------+----------+------+ | strand | primary_base | mod_code | mode | +--------+--------------+----------+------+ | + | C | 21839 | . | | + | C | m | . | +--------+--------------+----------+------+

Hedi65 avatar Sep 22 '25 09:09 Hedi65

Hi @Hedi65,

I don't think I'm clear on what the problem is then - your original issue said that you were seeing MM-tag-missing messages, but we've now confirmed that the tags are in fact present.

If the dorado commands are correctly generating and preserving the tags then there is no dorado issue and I'm going to hand this back to @ArtRand to determine what the problem is on the modkit side.

malton-ont avatar Sep 22 '25 09:09 malton-ont

Thank you for being so helpful; I appreciate it.

I'm pretty sure that the problem originated from merging the BAM files. One of the students used cat to merge the BAM instead of samtools.

Hedi65 avatar Sep 22 '25 09:09 Hedi65