modkit icon indicating copy to clipboard operation
modkit copied to clipboard

Convert bam to pseudo bisulfite output

Open faulk-lab opened this issue 1 year ago • 37 comments

Is there a way to convert a modified mapped or unmapped bam containing the MM/ML tags into a pseudo bisulfite output? In other words, convert every unmodified C to T and output a new bam? This would greatly benefit downstream analysis with existing tools that rely on bisulfite reads going into Bismark or abismal for example.

Any method to produce such a file would be very useful!

faulk-lab avatar Mar 01 '24 01:03 faulk-lab

Hello @faulk-lab,

That's not likely a feature we're going to add I'm afraid. Especially considering that 3-letter alignment (with tools like Bismark) is something we're trying to avoid. May I ask what downstream transformations/analysis you're trying to perform that modkit will not already do? Maybe I can add those to modkit instead.

ArtRand avatar Mar 01 '24 14:03 ArtRand

The reasoning is understandable. The measures I'm most interested in are entropy and partially methylated domains (PMDs). There is currently no method or tool to calculate methylation entropy using native long-read data, but several tools can do this with bisulfite data. Specifically dnmtools is one.

Entropy is particularly important in aging so I'd really like to be able to measure it. You should be able to add a function to modkit for it fairly simply I hope. Let me know how I could help!

faulk-lab avatar Mar 01 '24 14:03 faulk-lab

Hey @faulk-lab,

An entropy calculation certainly something I can add. Entropy is especially interesting when you consider that Nanopore reads will have 5hmC as well as 5mC. I'm also actively working on adding segmentation for things like DMRs, so it would be reasonable to extend it to PMDs, hyper/hypo-methylated regions, etc. Would you be potentially interested in testing a prototype (once I have it)?

ArtRand avatar Mar 01 '24 15:03 ArtRand

For sure! I'd definitely be interested in beta testing. I have lots of relevant bam files that would be a great set of controls for it.

faulk-lab avatar Mar 01 '24 15:03 faulk-lab

Sounds good, I'll hand over a build when it's ready, thanks!

ArtRand avatar Mar 01 '24 17:03 ArtRand

Hello @faulk-lab,

I have a branch with an entropy calculation to try. A quick disclaimer, I would consider this a "alpha"-level feature and certainly not production-ready. There is some missing functionality, for example duplex reads don't work yet, but the command generally-speaking performs the computation I expect. I'm very keen to get your input.

The command works generally like pileup in that it takes a sorted-indexed mod-BAM and outputs a table. The output format is a bedGraph with the schema:

column name description type
1 chrom contig name string
2 start start of interval int
3 end end of interval int
4 entropy methylation entropy float
5 num_reads number of reads used int
$ modkit entropy
Use a mod-BAM to calculate methylation entropy over genomic windows

Usage: modkit entropy [OPTIONS] --ref <REFERENCE_FASTA> <IN_BAM>

Arguments:
  <IN_BAM>  Input mod-BAM

Options:
  -o, --out-bed <OUT_BED>                    Output BED // should be bedGraph, default to stdout
  -n, --num-positions <NUM_POSITIONS>        Number of modified positions to consider at a time [default: 4]
  -w, --window-size <WINDOW_SIZE>            Maximum length interval that "num_positions" modified bases can occur in. The maximum window size decides how dense the positions are packed. For example, consider that the num_positions is equal to 4, the motif is CpG, and the window_size is equal to 8, this configuration
                                             would require that the modified positions are immediately adjacent to each other, "CGCGCGCG". On the other hand, if the window_size was set to 12, then multiple sequences with various patterns of other bases can be used CGACGATCGGCG [default: 50]
      --no-filtering                         Do not perform any filtering, include all mod base calls in output
      --filter-threshold <FILTER_THRESHOLD>  Specify the filter threshold globally or for the canonical calls. When specified, base modification call probabilities will be required to be greater than or equal to this number. If `--mod-thresholds` is also specified, _this_ value will be used for canonical calls
      --mod-thresholds <MOD_THRESHOLDS>      Specify a passing threshold to use for a base modification, independent of the threshold for the primary sequence base or the default. For example, to set the pass threshold for 5hmC to 0.8 use `--mod-threshold h:0.8`. The pass threshold will still be estimated as usual
                                             and used for canonical cytosine and other modifications unless the `--filter-threshold` option is also passed. See the online documentation for more details
  -t, --threads <THREADS>                    Number of threads to use [default: 4]
      --ref <REFERENCE_FASTA>                Reference sequence in FASTA format
      --motif <MOTIF> <MOTIF>                Motif to use for entropy calculation, default will be CpG. The motif must be reverse-complement palindromic
      --min-coverage <MIN_VALID_COVERAGE>    Minimum coverage required at each position in the window. Windows without at least this many valid reads will be skipped, but positions within the window with enough coverage can be used by neighboring windows [default: 3]
      --log-filepath <LOG_FILEPATH>          Send debug logs to this file, setting this file is recommended
      --force                                Force overwrite output
  -h, --help                                 Print help information (use `--help` for more detail)

In my opinion the num-positions and window-size are the most confusing part. Hopefully the help string is clear - but don't hesitate to ask questions.

The entropy calculation is:

H = (-1 / num_positions) * sum([prob(pattern) * log2(prob(pattern)) for pattern in patterns])

where a pattern is the epiallele of num-position CpGs (or whatever motif you specify). For example, consider a window that has 4 CpGs represented by h for 5hmC m for 5mC and C for canonical.

patterns = [
  CCCC,
  CCCC,
  CCCC,
  CCCC,
  CCCC,
  CCCC,
]
H = 0 (actually will come out to be -0.0)

patterns = [
  CCCC,
  CCCC,
  CCCC,
  mmmm,
  mmmm,
  mmmm,
]
H = 0.25

patterns = [
  CCCC,
  CCCC,
  CCCC,
  CCmm,
  hhhh,
  mmhh,
]
H = 0.4481203

You can change the number of positions (4 in the above example) and the window size. If you set the number of positions to 1 you'll get the entropy per CpG.

I really appreciate the feedback and I'll be running some experiments myself as well.

modkit_devde38fa00_centos7_x86_64.tar.gz

branch

ArtRand avatar Mar 14 '24 00:03 ArtRand

Very excited to try this! Off to a strong start, it appears to be working well, but then I run into a crash. (base) minknow@betsy:~/Desktop/entropy-test$ ../modkit_devel/modkit entropy --ref ~/Desktop/genomes/mouse/mm10.fa ../mouse_bams/mouse_modmapped-mm10/mouse.mm10.modmapped.bam -t 32 -o mouse.mm10.entropy.bed

window size is set to 50, motif length is 2 starting with contig chr1 at 0-based position 826 attempting to sample 1042 reads 112765 rows written 99304 windows with zero coverage 15704 entropy windows failed 0 batches failed
thread '' panicked at src/entropy/methylation_entropy.rs:106:5: assertion failed: (total - sequences.len() as f32) < 1e-3 note: run with RUST_BACKTRACE=1 environment variable to display a backtrace thread '' panicked at src/entropy/methylation_entropy.rs:106:5: assertion failed: (total - sequences.len() as f32) < 1e-3 thread '' panicked at src/entropy/methylation_entropy.rs:106:5: assertion failed: (total - sequences.len() as f32) < 1e-3 thread '' panicked at src/entropy/methylation_entropy.rs:106:5: assertion failed: (total - sequences.len() as f32) < 1e-3 Rayon: detected unexpected panic; aborting Aborted (core dumped)

faulk-lab avatar Mar 14 '24 04:03 faulk-lab

@faulk-lab,

Ok. I pushed a change that will log when this happens instead of firing an assert. I'm going to try and find an example that exposes the problem on my side. If I can't I might reach out and ask for the mod-BAM you're using that causes the problem. Thanks again for testing this out!

modkit_dev8dd548f_centos7_x86_64.tar.gz

ArtRand avatar Mar 14 '24 14:03 ArtRand

This version completes successfully. The num-positions and window-size parameters make perfect sense to me as described. However I don't know what sensible defaults should be. I see that DNMTools uses 4 CpGs as a sliding window so I presume that would be a reasonable num-positions default. I'll explore window size variations and see what kind of reasonable biological interpretation falls out. I'm planning on intersecting results with CpG island and shore beds.

I'm happy to share the mod-bam I'm using if you need it for testing.

faulk-lab avatar Mar 14 '24 16:03 faulk-lab

One feature request would be a flag to automatically exclude "-0" windows from the output.

faulk-lab avatar Mar 14 '24 16:03 faulk-lab

Great. I'm doing some sanity-check experiments as well.

4 CpGs seems to be the "standard" for BS-Seq tools, but from what I've seen (and I might be wrong) there is no limitation on how close together these CpGs actually are. So you could have 4 CpGs in 8 bp or 10 kbp, this is why I added the --window-size parameter. Maybe with short reads, the window size is effectively set by the read length so it's not a problem?

ArtRand avatar Mar 14 '24 17:03 ArtRand

I think you want an option to drop reads with >X% filtered calls in the window.

ArtRand avatar Mar 15 '24 04:03 ArtRand

Here's a build with --drop-zeros and --max-filtered-positions. I found that requiring that a read has at least 3 passing calls (--max-filtered-positions 1, --num-positions 4) improves the resolution of the output. modkit_dev0c11e15_centos7_x86_64.tar.gz

ArtRand avatar Mar 15 '24 16:03 ArtRand

Ok, I'm trying this one on a couple of datasets and I'll run it with your specified parameters. I'll let you know how it comes out.

faulk-lab avatar Mar 15 '24 16:03 faulk-lab

I'm still seeing "-0" even when using the --drop-zeros flag, is that intended behavior?

faulk-lab avatar Mar 15 '24 21:03 faulk-lab

Nope, that's a bug 🐛 . I'll fix it.

ArtRand avatar Mar 15 '24 22:03 ArtRand

Should be fixed on the branch. Here's a build. modkit_dev7acefc4_centos7_x86_64.tar.gz

ArtRand avatar Mar 15 '24 23:03 ArtRand

Hello @faulk-lab, find anything interesting? I'm prioritizing which features to clean up and get into the next release. I'll also settle for "the entropy calculations make sense, but I haven't found anything particularly interesting yet" :).

ArtRand avatar Mar 26 '24 23:03 ArtRand

Feature is working fine, just not a lot interesting scientifically. I've run a few datasets that I expect should have some differences in entropy, yet they all come out extremely similar. Not a tool problem, perhaps a parameter problem. My current strategy is to segment the genome various ways to look for regions of greatest expected entropy divergence, e.g., CpG shores, promoters, maybe transposons, I'm not really sure where's the best place to look. My suspicion is that 4 sites is too low. I was out for a week, but now I'm getting back to it.

The entropy calculations make sense, but I haven't found anything interesting yet. I'm still looking though!!!

faulk-lab avatar Mar 26 '24 23:03 faulk-lab

No problem at all. I like the idea of increasing the --num-positions parameter, using 4 feels like a "short-read" legacy artifact. Glad it's working, I'll push to get it into the next release in a more feature-complete form (e.g. separate on haplotypes or BAM tags, etc.).

ArtRand avatar Mar 27 '24 00:03 ArtRand

I'm not sure the --num-positions always makes sense in a genomic context at all. For instance, right now I just grabbed the flanking 2000 bp of every CpG island in the genome, e.g., shores, and I'd like to know the entropy for those regions, regardless of their number of CpG sites, just based on the regional size. So I'm setting --window-size to 2000, but what do I set --num-positions to? It seems artificial to enforce it.

faulk-lab avatar Mar 27 '24 00:03 faulk-lab

The solution might be to provide modkit with a .bed of target regions along with the bam, for calculating entropy in those regions as a whole. I think that's more what my mind is envisioning.

faulk-lab avatar Mar 27 '24 00:03 faulk-lab

I can add the calculation over input regions, it's easy to implement. However, one wrinkle is what do you do if none of your reads span the entire region? Do you skip the calculation? I can think of a bunch of fancier things to do, but it will require some iteration. We've also tossed around the idea of implementing a moving average, which might be the simplest approach.

ArtRand avatar Mar 27 '24 00:03 ArtRand

I went back to the original Xie et al. 2011 paper where it was originally defined. They use 4 sites as an example to show the calculation, but everywhere else in the paper, they calculate entropy based on regions, sometimes variable in size, and don't fix the number of CpG sites in either a window or a read. They purely focus on target region entropy and depth.

It's stated several times that the reads covering their region are contiguous which implies that reads not spanning the full region are not used in the calculation, however I'm not sure how that works for regions longer than their read length. It's honestly a little confusing on what they did in those cases.

For modkit, I would expect a calculation based only on reads spanning the whole region, and if there are none (or none above a set minimum of 4-8ish), then don't report an entropy at all. Then that would mean entropy calculations are dependent on read length, and large regions like "introns" wouldn't get a calculation unless you fell back to window sizes and bed intersects. Just spitballing here.

PS. Of course Andy Feinberg's group calculates it quite differently as well. There is not a straightforward agreed upon metric. doi: [10.1093/nar/gkad050]

faulk-lab avatar Mar 27 '24 23:03 faulk-lab

Interesting, for the windowing I took a queue from dnmtools actually. I think regions makes sense, of course as you mention, you'll have to keep the length of the region in context.

Let me add a --regions option that'll be similar to modkit dmr. I have some ideas for handling regions wider than the reads, I'll do some experiments and share once I've got them implemented. Thank you as always for the feedback.

A

ArtRand avatar Mar 28 '24 00:03 ArtRand

Hey @faulk-lab,

I've implemented the ability to calculate entropy over user-defined regions. You can find the code on the ar/entropy branch and a build attached here.

There are a few updates and changes.

  1. You can now get entropy calculations for the (+) and (-) strand. To combine the strands use --combine-strands (like pileup). There is also a shortcut --cpg that will effectively add --motif CG 0 and --combine-strands. Use --cpg to get the same settings as were available before.
  2. Use --regions and a tab-separated BED3 or BED4 file of regions to aggregate the entropy windows over those regions. After some internal discussion, we decided to first try a "descriptive statistics" approach. The entropy calculation is the same as before (--num-positions modified positions within --window-size genomic base pairs), but at the end the mean, median, max, and min entropy of the windows is calculated and reported. I've attached the schema of the --regions output below.
  3. You can use --motif or --base to get entropy on other sequences, this addition is mostly to aid some of my experiments, most folks like yourself will probably want --motif CG 0 or just --cpg.
  4. Use --header and get a header line on the output (a small thing, but handy).

I also found and fixed a few bugs that were discarding reads, so some of your previous calculations might change.

Schema of output table with --regions:

column no. name description type
1 chrom contig name str
2 start 0-based start position int
3 end 0-based end position int
4 region_name name from the input file, or :- str
5 mean_entropy simple average of the entropy of all the windows in the region float
6 strand strand, + or -, combine strands should be + str
7 median_entropy median of entropy values for the windows in a region float
8 min_entropy minimum window entropy in the region float
9 max_entropy maximum window entropy in the region float
10 mean_num_reads simple average of the number of reads in the windows float
11 min_num_reads minimum of the number of reads in the windows float
12 max_num_reads maximum of the number of reads in the windows float
13 successful_window_count number of windows in the region that entropy were successfully calculated int
14 failed_window_count number of windows in the region that entropy were not successfully calculated int

Here's an example of the output:

chr20  9838623   9839212   r127  0.68565124  +  0.5304569   0.88880104  0.32367444  48.2       45  51  45   0
chr20  10034962  10035265  r128  0.20862168  +  0.3445545   0.5381736   0           41.34375   37  43  32   0
chr20  10172120  10172544  r129  0.4214942   +  0.54357606  0.77866566  0           34.78125   33  37  32   0
chr20  10217487  10218335  r130  0.7906345   +  0.8130215   0.9690159   0.47927356  36.464287  32  39  56   0
chr20  10433628  10434344  r131  0.21610866  +  0           0.46761268  0           56.411766  52  59  68   0
chr20  10671925  10674962  r132  0.22791503  +  0.14538011  0.6451373   0           50.457874  43  56  273  0

Qualitatively, you can see the different amounts of randomness:

r130

high_entropy_cpg_island

r131

low_entropy_cpg_island

modkit_devc9cee075_centos7_x86_64.tar.gz

ArtRand avatar Apr 09 '24 15:04 ArtRand

I fixed a few bugs and --regions now outputs the "window" bedgraph as well as the region summaries. modkit_devb8f1f9ae_centos7_x86_64.tar.gz

ArtRand avatar Apr 11 '24 18:04 ArtRand

This regional method seems to be exactly what I'm looking for, except for an error in some bed files.

I got the new version working with my two input bam files from chimp 30X, however it core dumps with some input bed files. 1) using the cpg islands track from UCSC panTro6, works fine. 2) I expanded and took -only- the flanking 2kb regions up and downstream into a second bed file with bedtools. That bed file starts to run, then core dumps after 6100 rows.

faulk-lab avatar Apr 14 '24 22:04 faulk-lab

@faulk-lab Can you send me the error you get and a sample of the BED file?

ArtRand avatar Apr 15 '24 13:04 ArtRand

Sure, here's one error and a sample of the bed file. I can upload other files if you tell me where.

(base) minknow@betsy:~/Desktop/entropy-test-chimp$ ./modkit-devel/modkit entropy --cpg --header -t 32 --ref ~/Desktop/genomes/panTro6.fa --regions panTro6.up1k.bed P15990chimpYF.4khz.modmapped.bam -o P15990chimpYF.up1k.bed
> using CpG motif and combining strands
> window size is set to 50, motif (RegexMotif { forward_pattern: OverlappingRegex { inner: Regex("CG") }, reverse_pattern: OverlappingRegex { inner: Regex("CG") }, motif_info: MotifInfo { forward_offset: 0, reverse_offset: 1, length: 2, is_palendrome: true }, raw_motif: "CG" }) length is 2
> combining (+)-strand and (-)-strand modification calls
> starting with region XM_016953709.1_up_1000_chr1_8359764_f at 0-based position 8359796 on contig chr1
> attempting to sample 1042 reads
> 0 rows written
> 0 regions with zero coverage
> 48041 rows written
> 294 regions with zero coverage
> 5334 regions failed
> 0 batches failed                                                                                                                                                                thread '<unnamed>' panicked at src/entropy/mod.rs:1074:13:
> Aborted (core dumped)

Bed file came directly from UCSC Table Browser

(base) minknow@betsy:~/Desktop/entropy-test-chimp$ head panTro6.up1k.bed
chr1    8359763 8360763 XM_016953709.1_up_1000_chr1_8359764_f   0       +
chr1    8359763 8360763 XM_016953713.1_up_1000_chr1_8359764_f   0       +
chr1    8359763 8360763 XM_016953695.2_up_1000_chr1_8359764_f   0       +
chr1    8359763 8360763 XM_016953700.2_up_1000_chr1_8359764_f   0       +
chr1    8359763 8360763 XM_016953669.2_up_1000_chr1_8359764_f   0       +
chr1    8359763 8360763 XR_001717885.2_up_1000_chr1_8359764_f   0       +
chr1    8359775 8360775 XM_016953688.2_up_1000_chr1_8359776_f   0       +
chr1    50306746        50307746        XR_002937998.1_up_1000_chr1_50306747_f  0       +
chr1    50306746        50307746        XR_001706588.2_up_1000_chr1_50306747_f  0       +
chr1    58735347        58736347        XR_001706896.2_up_1000_chr1_58735348_r  0       -

faulk-lab avatar Apr 15 '24 16:04 faulk-lab