modkit icon indicating copy to clipboard operation
modkit copied to clipboard

modkit dmr pair of pileup bed file subset to 4mC

Open SimonChen1997 opened this issue 1 year ago • 6 comments

Hi,

I ran the [email protected]_4mC_5mC@v2 modified model from Dorado. And I use modkit to generate the bed file including 4mC and 5mC positions using this command:

modkit pileup $minimap2_primary/ecoli_a_${SLURM_ARRAY_TASK_ID}_primary.sorted.bam $modkit_pileup_primary/ecoli_a_${SLURM_ARRAY_TASK_ID}_pileup_C0.bed --motif C 0 --ref $ref

and then I use awk to subset the bed file to only contain 4mC positions:

awk -F"\t" '$4=="21839" {print $0}' $file > $subset_m4C/${file}_5mC.bed

Finally, I use the dmr function to see the differential methylated position (4mC):

modkit dmr pair \
-a $pileup_m4C/ecoli_a_2_4mC.bed.gz \
-a $pileup_m4C/ecoli_a_3_4mC.bed.gz \
-b $pileup_m4C/ecoli_b_2_4mC.bed.gz \
-b $pileup_m4C/ecoli_b_3_4mC.bed.gz \
--ref $ref \
--out-path $modkit_dmr_file_m4C \
--header \
--base C \
-t 4

However, it showed:

finished, processed 0 sites successfully, 0 failed

and the log file showed:

[src/dmr/subcommands.rs::328][2024-10-16 09:55:03][DEBUG] using primary sequence base C
[src/dmr/subcommands.rs::406][2024-10-16 09:55:03][INFO] reading reference FASTA at "/scratch/project/genoepic_rumen/e_coli_reference/ncbi_ref/GCF_000008865.2_ASM886v2_genomic.fna"
[src/dmr/subcommands.rs::421][2024-10-16 09:55:03][INFO] running single-site analysis
[src/dmr/single_site.rs::76][2024-10-16 09:55:03][INFO] using default prior, Beta(α: 0.55, β: 0.55)
[src/dmr/single_site.rs::751][2024-10-16 09:55:03][INFO] estimating max coverages from data
[src/dmr/single_site.rs::795][2024-10-16 09:55:05][INFO] sampled 1073328 a records and 1075105 b records, calculating max coverages for 95th percentile
[src/dmr/single_site.rs::820][2024-10-16 09:55:05][INFO] calculated max coverage for a: 55 and b: 56
[src/dmr/single_site.rs::140][2024-10-16 09:55:05][INFO] running with replicates and matched samples
[src/dmr/single_site.rs::211][2024-10-16 09:55:07][DEBUG] batch failed, failed to aggregate counts, invalid data, valid coverage (21) is not equal to the sum of canonical and modified counts (20), chrom: NC_002695.2 starting at 1105536
[src/dmr/single_site.rs::211][2024-10-16 09:55:07][DEBUG] batch failed, failed to aggregate counts, invalid data, valid coverage (21) is not equal to the sum of canonical and modified counts (19), chrom: NC_002695.2 starting at 806329
[src/dmr/single_site.rs::211][2024-10-16 09:55:07][DEBUG] batch failed, failed to aggregate counts, invalid data, valid coverage (20) is not equal to the sum of canonical and modified counts (17), chrom: NC_002695.2 starting at 251648
[src/dmr/single_site.rs::211][2024-10-16 09:55:07][DEBUG] batch failed, failed to aggregate counts, invalid data, valid coverage (29) is not equal to the sum of canonical and modified counts (1), chrom: NC_002695.2 starting at 976284
[src/dmr/single_site.rs::211][2024-10-16 09:55:08][DEBUG] batch failed, failed to aggregate counts, invalid data, valid coverage (33) is not equal to the sum of canonical and modified counts (32), chrom: NC_002695.2 starting at 514816
[src/dmr/single_site.rs::211][2024-10-16 09:55:08][DEBUG] batch failed, failed to aggregate counts, invalid data, valid coverage (53) is not equal to the sum of canonical and modified counts (52), chrom: NC_002695.2 starting at 186917

Is there any better way to subset the pileup bed file to 4mC and do dmr?

Cheers, Ziming

SimonChen1997 avatar Oct 15 '24 23:10 SimonChen1997

I also looked into the pileup bed file before subseting using this command:

awk -F"\t" '$10 != ($12 + $13) {print $0}' ecoli_a_2_pileup_C0.bed | head

while it showed:

NC_002695.2	1	2	m	29	-	1	2	255,0,0	29	0.00	0	28	1	0	0	0	0
NC_002695.2	23	24	m	31	-	23	24	255,0,0	31	0.00	0	29	2	0	0	0	0
NC_002695.2	25	26	m	32	+	25	26	255,0,0	32	0.00	0	31	1	0	0	0	0
NC_002695.2	31	32	m	31	-	31	32	255,0,0	31	0.00	0	30	1	0	0	0	0
NC_002695.2	33	34	m	31	+	33	34	255,0,0	31	3.23	1	28	2	0	1	0	0
NC_002695.2	33	34	21839	31	+	33	34	255,0,0	31	6.45	2	28	1	0	1	0	0
NC_002695.2	39	40	m	29	-	39	40	255,0,0	29	0.00	0	26	3	0	2	0	0
NC_002695.2	53	54	m	28	-	53	54	255,0,0	28	0.00	0	27	1	0	3	0	0
NC_002695.2	72	73	m	31	+	72	73	255,0,0	31	0.00	0	28	3	0	2	0	0
NC_002695.2	79	80	m	29	-	79	80	255,0,0	29	0.00	0	28	1	1	1	0	0

However, when I used these complete pileup bed files (no subseting), the dmr function worked well without errors.

SimonChen1997 avatar Oct 16 '24 01:10 SimonChen1997

Hello @SimonChen1997,

You cannot filter the bedMethyl as you have done prior to using modkit dmr. You need to have the records for each base modification (4mC and 5mC in this case) in the bedMethyl for the algorithm to work, this is what those log lines are trying to tell you (albeit cryptically).

ArtRand avatar Oct 16 '24 04:10 ArtRand

You cannot filter the bedMethyl as you have done prior to using modkit dmr. You need to have the records for each base modification (4mC and 5mC in this case) in the bedMethyl for the algorithm to work, this is what those log lines are trying to tell you (albeit cryptically).

Hi ArtRand,

Thanks for your reply. But is there any way to do only 4mC DMR?

SimonChen1997 avatar Oct 16 '24 04:10 SimonChen1997

I would recommend using the --ignore m option to pileup in order to remove the 5mC calls from the pileup.

marcus1487 avatar Oct 16 '24 06:10 marcus1487

I would recommend using the --ignore m option to pileup in order to remove the 5mC calls from the pileup.

thanks for reply. --ignore m worked. However, I had another issue. I rebasecalled the m6A, m4C, and m5C at the same time, and then I run these codes:

modkit adjust-mods $minimap2/ecoli_${SLURM_ARRAY_TASK_ID}.sorted.bam stdout --ignore m | modkit adjust-mods stdin $minimap2_m6a/ecoli_${SLURM_ARRAY_TASK_ID}_m6a.bam --ignore 21839

samtools view -bhS $minimap2_m6a/ecoli_${SLURM_ARRAY_TASK_ID}_m6a.bam | \
samtools sort -T $minimap2_m6a/ecoli_${SLURM_ARRAY_TASK_ID}_m6a.sorted -o $minimap2_m6a/ecoli_${SLURM_ARRAY_TASK_ID}_m6a.sorted.bam

samtools index $minimap2_m6a/ecoli_${SLURM_ARRAY_TASK_ID}_m6a.sorted.bam

modkit pileup $minimap2_m6a/ecoli_${SLURM_ARRAY_TASK_ID}_m6a.sorted.bam $modkit_pileup/ecoli_${SLURM_ARRAY_TASK_ID}_pileup_m6a.bed --motif A 0 --ref $ref

awk -F "\t" '{print $4}' ecoli_${SLURM_ARRAY_TASK_ID}_pileup_m6a.bed | sort | uniq

it showed these two values in all files:

a
C

what does the C mean here?

the bed file is like this:

NC_002695.2     124     125     a       21      -       124     125     255,0,0 21      0.00    0       21      0       0       0       0       0
NC_002695.2     125     126     C       1       +       125     126     255,0,0 1       0.00    0       1       0       0       0       22      0
NC_002695.2     125     126     a       22      +       125     126     255,0,0 22      0.00    0       22      0       0       0       1       0

SimonChen1997 avatar Oct 24 '24 10:10 SimonChen1997

Hello @SimonChen1997,

Sorry for taking so long to respond to this thread.

The short answer is the C means "any cytosine modification". The reason this shows up is at position 125 you have a single read with an A>C mismatch. The reason this shows up after you've --ignored the m5C and m4C calls is a little more nuanced, it helps to break it down step by step:

modkit adjust-mods $minimap2/ecoli_${SLURM_ARRAY_TASK_ID}.sorted.bam stdout --ignore m

This step removes the m5C probabilities (using the algorithm here) leaving you with unmodified and 4mC probabilities.

modkit adjust-mods stdin $minimap2_m6a/ecoli_${SLURM_ARRAY_TASK_ID}_m6a.bam --ignore 21839

The second step removes the m4C probabilities, leaving you with only unmodified probabilities. All of these probabilities will be 1.0, since all of the probability mass has been moved into the unmodified class.

If you were to run modkit summary on this modBAM you will probably see something like this:

# bases             A,C
# total_reads_used  10042
# count_reads_C     9982
# count_reads_A     10000
# pass_threshold_A  0.6855469
# pass_threshold_C  1
 base  code  pass_count  pass_frac    all_count  all_frac
 C     -     7402043     1            7402043    1
 C     C     0           0            0          0
 A     -     9368729     0.91072136   10038271   0.8788549
 A     a     918423      0.089278646  1383718    0.1211451

What this is saying is you have "any-cytosine modification" information, however all of the calls (pass_frac or all_frac) are unmodified (-).

I think what you want is to fully remove the cytosine modification information from the reads and/or only keep the bedMethyl records for modifications that apply to the reference base (i.e. ignore mismatches). The next version of Modkit will have more "tag manipulation" functionality so you could remove the cytosine modification information if you want. I generally recommend against this, however, since I like to think of modBAMs as an immutable data source and I'd rather filter or select the data I want from them during a transformation instead of copying data around. If you want a bedMethyl that only has records corresponding to reference adenine bases (based on your use of --motif A 0) then I would filter the output of pileup with awk ($4=='a'). Also if you're going to use these data in dmr, that command is smart enough to only use base modifications that modify the primary sequence base that the use specifies - so you don't need to do this step ahead of time.

Also, another option for your dmr work above is to parse the output and look for regions where only the m4C levels change.

ArtRand avatar Oct 31 '24 00:10 ArtRand