Convert bam to pseudo bisulfite output
Is there a way to convert a modified mapped or unmapped bam containing the MM/ML tags into a pseudo bisulfite output? In other words, convert every unmodified C to T and output a new bam? This would greatly benefit downstream analysis with existing tools that rely on bisulfite reads going into Bismark or abismal for example.
Any method to produce such a file would be very useful!
Hello @faulk-lab,
That's not likely a feature we're going to add I'm afraid. Especially considering that 3-letter alignment (with tools like Bismark) is something we're trying to avoid. May I ask what downstream transformations/analysis you're trying to perform that modkit will not already do? Maybe I can add those to modkit instead.
The reasoning is understandable. The measures I'm most interested in are entropy and partially methylated domains (PMDs). There is currently no method or tool to calculate methylation entropy using native long-read data, but several tools can do this with bisulfite data. Specifically dnmtools is one.
Entropy is particularly important in aging so I'd really like to be able to measure it. You should be able to add a function to modkit for it fairly simply I hope. Let me know how I could help!
Hey @faulk-lab,
An entropy calculation certainly something I can add. Entropy is especially interesting when you consider that Nanopore reads will have 5hmC as well as 5mC. I'm also actively working on adding segmentation for things like DMRs, so it would be reasonable to extend it to PMDs, hyper/hypo-methylated regions, etc. Would you be potentially interested in testing a prototype (once I have it)?
For sure! I'd definitely be interested in beta testing. I have lots of relevant bam files that would be a great set of controls for it.
Sounds good, I'll hand over a build when it's ready, thanks!
Hello @faulk-lab,
I have a branch with an entropy calculation to try. A quick disclaimer, I would consider this a "alpha"-level feature and certainly not production-ready. There is some missing functionality, for example duplex reads don't work yet, but the command generally-speaking performs the computation I expect. I'm very keen to get your input.
The command works generally like pileup in that it takes a sorted-indexed mod-BAM and outputs a table. The output format is a bedGraph with the schema:
| column | name | description | type |
|---|---|---|---|
| 1 | chrom | contig name | string |
| 2 | start | start of interval | int |
| 3 | end | end of interval | int |
| 4 | entropy | methylation entropy | float |
| 5 | num_reads | number of reads used | int |
$ modkit entropy
Use a mod-BAM to calculate methylation entropy over genomic windows
Usage: modkit entropy [OPTIONS] --ref <REFERENCE_FASTA> <IN_BAM>
Arguments:
<IN_BAM> Input mod-BAM
Options:
-o, --out-bed <OUT_BED> Output BED // should be bedGraph, default to stdout
-n, --num-positions <NUM_POSITIONS> Number of modified positions to consider at a time [default: 4]
-w, --window-size <WINDOW_SIZE> Maximum length interval that "num_positions" modified bases can occur in. The maximum window size decides how dense the positions are packed. For example, consider that the num_positions is equal to 4, the motif is CpG, and the window_size is equal to 8, this configuration
would require that the modified positions are immediately adjacent to each other, "CGCGCGCG". On the other hand, if the window_size was set to 12, then multiple sequences with various patterns of other bases can be used CGACGATCGGCG [default: 50]
--no-filtering Do not perform any filtering, include all mod base calls in output
--filter-threshold <FILTER_THRESHOLD> Specify the filter threshold globally or for the canonical calls. When specified, base modification call probabilities will be required to be greater than or equal to this number. If `--mod-thresholds` is also specified, _this_ value will be used for canonical calls
--mod-thresholds <MOD_THRESHOLDS> Specify a passing threshold to use for a base modification, independent of the threshold for the primary sequence base or the default. For example, to set the pass threshold for 5hmC to 0.8 use `--mod-threshold h:0.8`. The pass threshold will still be estimated as usual
and used for canonical cytosine and other modifications unless the `--filter-threshold` option is also passed. See the online documentation for more details
-t, --threads <THREADS> Number of threads to use [default: 4]
--ref <REFERENCE_FASTA> Reference sequence in FASTA format
--motif <MOTIF> <MOTIF> Motif to use for entropy calculation, default will be CpG. The motif must be reverse-complement palindromic
--min-coverage <MIN_VALID_COVERAGE> Minimum coverage required at each position in the window. Windows without at least this many valid reads will be skipped, but positions within the window with enough coverage can be used by neighboring windows [default: 3]
--log-filepath <LOG_FILEPATH> Send debug logs to this file, setting this file is recommended
--force Force overwrite output
-h, --help Print help information (use `--help` for more detail)
In my opinion the num-positions and window-size are the most confusing part. Hopefully the help string is clear - but don't hesitate to ask questions.
The entropy calculation is:
H = (-1 / num_positions) * sum([prob(pattern) * log2(prob(pattern)) for pattern in patterns])
where a pattern is the epiallele of num-position CpGs (or whatever motif you specify). For example, consider a window that has 4 CpGs represented by h for 5hmC m for 5mC and C for canonical.
patterns = [
CCCC,
CCCC,
CCCC,
CCCC,
CCCC,
CCCC,
]
H = 0 (actually will come out to be -0.0)
patterns = [
CCCC,
CCCC,
CCCC,
mmmm,
mmmm,
mmmm,
]
H = 0.25
patterns = [
CCCC,
CCCC,
CCCC,
CCmm,
hhhh,
mmhh,
]
H = 0.4481203
You can change the number of positions (4 in the above example) and the window size. If you set the number of positions to 1 you'll get the entropy per CpG.
I really appreciate the feedback and I'll be running some experiments myself as well.
Very excited to try this! Off to a strong start, it appears to be working well, but then I run into a crash. (base) minknow@betsy:~/Desktop/entropy-test$ ../modkit_devel/modkit entropy --ref ~/Desktop/genomes/mouse/mm10.fa ../mouse_bams/mouse_modmapped-mm10/mouse.mm10.modmapped.bam -t 32 -o mouse.mm10.entropy.bed
window size is set to 50, motif length is 2 starting with contig chr1 at 0-based position 826 attempting to sample 1042 reads 112765 rows written 99304 windows with zero coverage 15704 entropy windows failed 0 batches failed
thread '' panicked at src/entropy/methylation_entropy.rs:106:5: assertion failed: (total - sequences.len() as f32) < 1e-3 note: run with RUST_BACKTRACE=1environment variable to display a backtrace thread '' panicked at src/entropy/methylation_entropy.rs:106:5: assertion failed: (total - sequences.len() as f32) < 1e-3 thread ' ' panicked at src/entropy/methylation_entropy.rs:106:5: assertion failed: (total - sequences.len() as f32) < 1e-3 thread ' ' panicked at src/entropy/methylation_entropy.rs:106:5: assertion failed: (total - sequences.len() as f32) < 1e-3 Rayon: detected unexpected panic; aborting Aborted (core dumped)
@faulk-lab,
Ok. I pushed a change that will log when this happens instead of firing an assert. I'm going to try and find an example that exposes the problem on my side. If I can't I might reach out and ask for the mod-BAM you're using that causes the problem. Thanks again for testing this out!
This version completes successfully. The num-positions and window-size parameters make perfect sense to me as described. However I don't know what sensible defaults should be. I see that DNMTools uses 4 CpGs as a sliding window so I presume that would be a reasonable num-positions default. I'll explore window size variations and see what kind of reasonable biological interpretation falls out. I'm planning on intersecting results with CpG island and shore beds.
I'm happy to share the mod-bam I'm using if you need it for testing.
One feature request would be a flag to automatically exclude "-0" windows from the output.
Great. I'm doing some sanity-check experiments as well.
4 CpGs seems to be the "standard" for BS-Seq tools, but from what I've seen (and I might be wrong) there is no limitation on how close together these CpGs actually are. So you could have 4 CpGs in 8 bp or 10 kbp, this is why I added the --window-size parameter. Maybe with short reads, the window size is effectively set by the read length so it's not a problem?
I think you want an option to drop reads with >X% filtered calls in the window.
Here's a build with --drop-zeros and --max-filtered-positions. I found that requiring that a read has at least 3 passing calls (--max-filtered-positions 1, --num-positions 4) improves the resolution of the output.
modkit_dev0c11e15_centos7_x86_64.tar.gz
Ok, I'm trying this one on a couple of datasets and I'll run it with your specified parameters. I'll let you know how it comes out.
I'm still seeing "-0" even when using the --drop-zeros flag, is that intended behavior?
Nope, that's a bug 🐛 . I'll fix it.
Should be fixed on the branch. Here's a build. modkit_dev7acefc4_centos7_x86_64.tar.gz
Hello @faulk-lab, find anything interesting? I'm prioritizing which features to clean up and get into the next release. I'll also settle for "the entropy calculations make sense, but I haven't found anything particularly interesting yet" :).
Feature is working fine, just not a lot interesting scientifically. I've run a few datasets that I expect should have some differences in entropy, yet they all come out extremely similar. Not a tool problem, perhaps a parameter problem. My current strategy is to segment the genome various ways to look for regions of greatest expected entropy divergence, e.g., CpG shores, promoters, maybe transposons, I'm not really sure where's the best place to look. My suspicion is that 4 sites is too low. I was out for a week, but now I'm getting back to it.
The entropy calculations make sense, but I haven't found anything interesting yet. I'm still looking though!!!
No problem at all. I like the idea of increasing the --num-positions parameter, using 4 feels like a "short-read" legacy artifact. Glad it's working, I'll push to get it into the next release in a more feature-complete form (e.g. separate on haplotypes or BAM tags, etc.).
I'm not sure the --num-positions always makes sense in a genomic context at all. For instance, right now I just grabbed the flanking 2000 bp of every CpG island in the genome, e.g., shores, and I'd like to know the entropy for those regions, regardless of their number of CpG sites, just based on the regional size. So I'm setting --window-size to 2000, but what do I set --num-positions to? It seems artificial to enforce it.
The solution might be to provide modkit with a .bed of target regions along with the bam, for calculating entropy in those regions as a whole. I think that's more what my mind is envisioning.
I can add the calculation over input regions, it's easy to implement. However, one wrinkle is what do you do if none of your reads span the entire region? Do you skip the calculation? I can think of a bunch of fancier things to do, but it will require some iteration. We've also tossed around the idea of implementing a moving average, which might be the simplest approach.
I went back to the original Xie et al. 2011 paper where it was originally defined. They use 4 sites as an example to show the calculation, but everywhere else in the paper, they calculate entropy based on regions, sometimes variable in size, and don't fix the number of CpG sites in either a window or a read. They purely focus on target region entropy and depth.
It's stated several times that the reads covering their region are contiguous which implies that reads not spanning the full region are not used in the calculation, however I'm not sure how that works for regions longer than their read length. It's honestly a little confusing on what they did in those cases.
For modkit, I would expect a calculation based only on reads spanning the whole region, and if there are none (or none above a set minimum of 4-8ish), then don't report an entropy at all. Then that would mean entropy calculations are dependent on read length, and large regions like "introns" wouldn't get a calculation unless you fell back to window sizes and bed intersects. Just spitballing here.
PS. Of course Andy Feinberg's group calculates it quite differently as well. There is not a straightforward agreed upon metric. doi: [10.1093/nar/gkad050]
Interesting, for the windowing I took a queue from dnmtools actually. I think regions makes sense, of course as you mention, you'll have to keep the length of the region in context.
Let me add a --regions option that'll be similar to modkit dmr. I have some ideas for handling regions wider than the reads, I'll do some experiments and share once I've got them implemented. Thank you as always for the feedback.
A
Hey @faulk-lab,
I've implemented the ability to calculate entropy over user-defined regions. You can find the code on the ar/entropy branch and a build attached here.
There are a few updates and changes.
- You can now get entropy calculations for the (+) and (-) strand. To combine the strands use
--combine-strands(likepileup). There is also a shortcut--cpgthat will effectively add--motif CG 0and--combine-strands. Use--cpgto get the same settings as were available before. - Use
--regionsand a tab-separated BED3 or BED4 file of regions to aggregate the entropy windows over those regions. After some internal discussion, we decided to first try a "descriptive statistics" approach. The entropy calculation is the same as before (--num-positionsmodified positions within--window-sizegenomic base pairs), but at the end the mean, median, max, and min entropy of the windows is calculated and reported. I've attached the schema of the--regionsoutput below. - You can use
--motifor--baseto get entropy on other sequences, this addition is mostly to aid some of my experiments, most folks like yourself will probably want--motif CG 0or just--cpg. - Use
--headerand get a header line on the output (a small thing, but handy).
I also found and fixed a few bugs that were discarding reads, so some of your previous calculations might change.
Schema of output table with --regions:
| column no. | name | description | type |
|---|---|---|---|
| 1 | chrom | contig name | str |
| 2 | start | 0-based start position | int |
| 3 | end | 0-based end position | int |
| 4 | region_name | name from the input file, or |
str |
| 5 | mean_entropy | simple average of the entropy of all the windows in the region | float |
| 6 | strand | strand, + or -, combine strands should be + |
str |
| 7 | median_entropy | median of entropy values for the windows in a region | float |
| 8 | min_entropy | minimum window entropy in the region | float |
| 9 | max_entropy | maximum window entropy in the region | float |
| 10 | mean_num_reads | simple average of the number of reads in the windows | float |
| 11 | min_num_reads | minimum of the number of reads in the windows | float |
| 12 | max_num_reads | maximum of the number of reads in the windows | float |
| 13 | successful_window_count | number of windows in the region that entropy were successfully calculated | int |
| 14 | failed_window_count | number of windows in the region that entropy were not successfully calculated | int |
Here's an example of the output:
chr20 9838623 9839212 r127 0.68565124 + 0.5304569 0.88880104 0.32367444 48.2 45 51 45 0
chr20 10034962 10035265 r128 0.20862168 + 0.3445545 0.5381736 0 41.34375 37 43 32 0
chr20 10172120 10172544 r129 0.4214942 + 0.54357606 0.77866566 0 34.78125 33 37 32 0
chr20 10217487 10218335 r130 0.7906345 + 0.8130215 0.9690159 0.47927356 36.464287 32 39 56 0
chr20 10433628 10434344 r131 0.21610866 + 0 0.46761268 0 56.411766 52 59 68 0
chr20 10671925 10674962 r132 0.22791503 + 0.14538011 0.6451373 0 50.457874 43 56 273 0
Qualitatively, you can see the different amounts of randomness:
r130
r131
I fixed a few bugs and --regions now outputs the "window" bedgraph as well as the region summaries.
modkit_devb8f1f9ae_centos7_x86_64.tar.gz
This regional method seems to be exactly what I'm looking for, except for an error in some bed files.
I got the new version working with my two input bam files from chimp 30X, however it core dumps with some input bed files. 1) using the cpg islands track from UCSC panTro6, works fine. 2) I expanded and took -only- the flanking 2kb regions up and downstream into a second bed file with bedtools. That bed file starts to run, then core dumps after 6100 rows.
@faulk-lab Can you send me the error you get and a sample of the BED file?
Sure, here's one error and a sample of the bed file. I can upload other files if you tell me where.
(base) minknow@betsy:~/Desktop/entropy-test-chimp$ ./modkit-devel/modkit entropy --cpg --header -t 32 --ref ~/Desktop/genomes/panTro6.fa --regions panTro6.up1k.bed P15990chimpYF.4khz.modmapped.bam -o P15990chimpYF.up1k.bed
> using CpG motif and combining strands
> window size is set to 50, motif (RegexMotif { forward_pattern: OverlappingRegex { inner: Regex("CG") }, reverse_pattern: OverlappingRegex { inner: Regex("CG") }, motif_info: MotifInfo { forward_offset: 0, reverse_offset: 1, length: 2, is_palendrome: true }, raw_motif: "CG" }) length is 2
> combining (+)-strand and (-)-strand modification calls
> starting with region XM_016953709.1_up_1000_chr1_8359764_f at 0-based position 8359796 on contig chr1
> attempting to sample 1042 reads
> 0 rows written
> 0 regions with zero coverage
> 48041 rows written
> 294 regions with zero coverage
> 5334 regions failed
> 0 batches failed thread '<unnamed>' panicked at src/entropy/mod.rs:1074:13:
> Aborted (core dumped)
Bed file came directly from UCSC Table Browser
(base) minknow@betsy:~/Desktop/entropy-test-chimp$ head panTro6.up1k.bed
chr1 8359763 8360763 XM_016953709.1_up_1000_chr1_8359764_f 0 +
chr1 8359763 8360763 XM_016953713.1_up_1000_chr1_8359764_f 0 +
chr1 8359763 8360763 XM_016953695.2_up_1000_chr1_8359764_f 0 +
chr1 8359763 8360763 XM_016953700.2_up_1000_chr1_8359764_f 0 +
chr1 8359763 8360763 XM_016953669.2_up_1000_chr1_8359764_f 0 +
chr1 8359763 8360763 XR_001717885.2_up_1000_chr1_8359764_f 0 +
chr1 8359775 8360775 XM_016953688.2_up_1000_chr1_8359776_f 0 +
chr1 50306746 50307746 XR_002937998.1_up_1000_chr1_50306747_f 0 +
chr1 50306746 50307746 XR_001706588.2_up_1000_chr1_50306747_f 0 +
chr1 58735347 58736347 XR_001706896.2_up_1000_chr1_58735348_r 0 -