modkit icon indicating copy to clipboard operation
modkit copied to clipboard

I have some questions regarding the bed methyl file, would you help?

Open ralanany opened this issue 10 months ago • 1 comments

Q1: Would you explain why there is difference?

I have bed methyl file for cytosine modification, what I have understood for each modified site, it measures the frequency of hydroxy methyl or methyl cytosine modification.

bed methyl file

1- chr1 10468 10469 h 6 . 10468 10469 255,0,0 6 0.00 0 1 5 0 1 0 0 2- chr1 10468 10469 m 6 . 10468 10469 255,0,0 6 83.33 5 1 0 0 1 0 0 3- chr1 10470 10471 h 6 . 10470 10471 255,0,0 6 0.00 0 1 5 0 1 0 0 4- chr1 10470 10471 m 6 . 10470 10471 255,0,0 6 83.33 5 1 0 0 1 0 0 5- chr1 10637 10638 h 7 . 10637 10638 255,0,0 7 28.57 2 0 5 0 0 0 0 6- chr1 10637 10638 m 7 . 10637 10638 255,0,0 7 71.43 5 0 2 0 0 0 0 7- chr1 10640 10641 h 6 . 10640 10641 255,0,0 6 16.67 1 0 5 0 1 0 0 8- chr1 10640 10641 m 6 . 10640 10641 255,0,0 6 83.33 5 0 1 0 1 0 0 9- chr1 10562 10563 h 6 . 10562 10563 255,0,0 6 0.00 0 5 1 0 0 1 0 10- chr1 10562 10563 m 6 . 10562 10563 255,0,0 6 16.67 1 5 0 0 0 1 0

when I used wc -l, The total no was 41503950

  • Then I separated the file into 2 other files based on the modification code (m/h). I found that the file with m modifications has 20751975, while file with h modifications has 19175142.

Would you explain why there is difference, although each site should mention both mods.

Q2: In some sites, the percent for hydroxy and m modifications is not equal to 0 and 100, as mentioned in the above lines (5-10). How can I decide whether this position have hydroxy or methyl modification?, How can I find the most significant modified positions as from arrays?

Q3: Regarding the summary modbam: Does the modification calls table represent the whole sample (whole modbam file) or only certain region (region chr 20) as mentioned in the totals table (the first part of the summary)?

summary output

parsing region chr20 # only present if --region option is provided sampling 10042 reads from BAM # modulated with --num-reads calculating threshold at 10% percentile # modulated with --filter-percentile calculated thresholds: C: 0.7167969 # calculated per-canonical base, on the fly

bases C total_reads_used 9989 count_reads_C 9989 pass_threshold_C 0.7167969 region chr20:0-64444167 base code pass_count pass_frac all_count all_frac C m 1192533 0.58716166 1305956 0.5790408 C h 119937 0.0590528 195335 0.086608544 C - 718543 0.3537855 754087 0.33435062

Thanks inadvance

ralanany avatar Jan 14 '25 08:01 ralanany

Thanks for your questions! Here are some clarifications:

  1. Discrepancy in row counts: Can you confirm how you counted the number of h rows? The total number of rows in your file doesn’t match the combined counts for m and h, which suggests something might be off in the filtering or counting step. Each site should indeed have both m and h lines, so the counts should be equal.
  2. Percentages for m and h: Unlike canonical bases, modified bases like 5mC and 5hmC typically exist as percentages across reads. It’s not unusual to see sites with non-zero values for both m and h. The significance of a site being modified depends on your system and the biological question. • For example, if you’re looking for highly modified sites, you might filter by a specific threshold (e.g., sites where the percentage for m or h exceeds a given value). • Modkit offers tools for advanced analysis, such as modkit dmr for comparing samples or modkit motif for motif discovery. If you can share more about your downstream goals, I’d be happy to guide you further.
  3. Summary Modbam: If you used the --region argument, the results in the summary file are restricted to that region (e.g., chr20). Without --region, the results would represent the entire sample.

Let me know if you need additional help!

marcus1487 avatar Jan 16 '25 20:01 marcus1487