methrix for ONT support
Hi @tkik
We should think of supporting ONT methylation files. An additional argument to pipeline to import modkit output would be an easier solution.
Clarissa is also quite interested in this.
Are you by any chance working with ONT or similar long-read data? We should talk about it.
Hi @PoisonAlien,
Good idea! I am working with ONT data a lot, I often read in modkit output. I just have to add it to methrix. Let's talk once, I have some ideas :) please write me an email so we can set up a meeting.
May I take the liberty of asking about the follow-up results? Because I am also trying to visualize the results of modkit or ONT again;
Hi @xinghui-guo ,
We are working on it. It should be available soon.
Hi Reka,
I have an implementation of native support for ModKit bedMethyl files through a new read_modkit() function on the modkit branch.
Initially, I tried implementing similar to how we process bedgraphs files. It works, but it chokes when the file size becomes larger. Especially with bedMethyl files containing 5mC, 5hmC, 6mA, 5fC, 5caC, etc. in a single file.
I gave up on native R implementation and switched to C code. Since bedMethyl files can be tabix-indexed, the code makes use of this. It will create one if it does not exist. Only requirement is that input files must be bgzip.
In my testing, it processes multiple 12GB bedMethyl files under a minute and consumes very little memory. So it's fast. You can specify what kind of modification to import with target_mod argument. Supports "m" (5mC), "h" (5hmC), "a" (6mA), "c" (5fC), "g" (5caC). Default "m".
When you provide the ref fasta file, it additionally extracts ref base and the surrounding context, adds it to the rowData()slot.
I remember you mentioned that you have a working code. Could you please try this and compare it with your approach?
# install
remotes::install_github(repo = "CompEpigen/methrix@modkit")
> m = read_modkit(files = c(BDG1, BDG2), target_mod = "m", ref_fasta = FA)
Reading ModKit bedMethyl files...
Files: 2
Target modification: 5-methylcytosine (5mC)
Samples: 2
Reference FASTA: hg38.p14.fa
Will extract reference base and sequence context
Auto-discovering chromosomes from 2 files...
Scanning /omics/odcf/analysis/OE0290_projects/pediatric_tumor/2025-06-23_nb_newborn/05-tmp/mkit_merge/Dimelo_1_20241001_PAW42303_ungrouped.wf_mods.bedmethyl.gz...
Scanning /omics/odcf/analysis/OE0290_projects/pediatric_tumor/2025-06-23_nb_newborn/05-tmp/mkit_merge/Dimelo_2_20241014_PAW37404_ungrouped.wf_mods.bedmethyl.gz...
Discovered 24 chromosomes with data
Generated 24 bins of size 1000000 covering data regions
Processing 24 bins sequentially...
Reference FASTA provided - extracting sequence context...
C processing completed in: 2.426025 secs
Sites discovered: 181,178
Reference context extracted for all sites
Added reference base and context information to rowData
Calculating basic statistics...
Total processing time: 3.560264 secs
Created methrix object: 181178 sites × 2 samples
Context distribution:
CG: 181,178
> m
An object of class methrix
n_CpGs: 181,178
n_samples: 2
is_h5: FALSE
Reference: hg38.p14.fa
> rowData(m)
DataFrame with 181178 rows and 5 columns
chr start strand ref_base context
<character> <integer> <character> <character> <character>
1 chr1 10468 . C CG
2 chr1 10470 . C CG
3 chr1 10488 . C CG
4 chr1 10492 . C CG
5 chr1 10496 . C CG
... ... ... ... ... ...
181174 chrY 2999329 . C CG
181175 chrY 2999368 . C CG
181176 chrY 2999371 . C CG
181177 chrY 2999393 . C CG
181178 chrY 2999592 . C CG
That's great, looking forward to you releasing the latest version results! @PoisonAlien
Hi Anand,
It looks interesting! We were focusing only on 5mC or 5hmC, so your approach seems superior to that. We will give it a try.
Do you think it would make sense to allow a methrix object to contain multiple modifications? For example all the 5C ones? We can easily implement the possibility of using additional assays.
Hi Reka,
Yes, I think it should be doable. Modkit's bedMethyl tools have nice options as well (merging similar mods).
Hi Reka,
Did you get a chance to test it? If this works for you, we can push it to Bioconductor for the upcoming October release.
Hi, I am testing it now. I am using DNA methylation from cfDNA, so not all the sites are covered. However, I am wondering why I end up with such a low number of sites. Trying with 3 files ranging from 2.5-14 million CG m sites.
meth <- read_modkit(bed_files[2:4], combine_strands = T, n_cores = 10, ref_fasta = "Q:/genome.fa", quality_filter = 0, min_coverage = 1)
Reading ModKit bedMethyl files...
Files: 3
Target modification: 5-methylcytosine (5mC)
Samples: 3
Reference FASTA: genome.fa
Will extract reference base and sequence context
Parallel cores: 10
Auto-discovering chromosomes from 3 files...
Scanning W:/PEAR-OP/LCS/BED/20250716_1130_MN48054_FBA70268_464b4301/LCS_1542418.bed.bgz...
Scanning W:/PEAR-OP/LCS/BED/20250723_0949_MN48054_FBA76850_bdb6eb3f/LCS_159286.bed.bgz...
Scanning W:/PEAR-OP/LCS/BED/20250731_1129_MN48054_FAY20326_0ed7aa9e/LCS_360272.bed.bgz...
Discovered 25 chromosomes with data
Generated 26 bins of size 1000000 covering data regions
Processing 26 bins sequentially...
Reference FASTA provided - extracting sequence context...
C processing completed in: 11.543 secs
Sites discovered: 405,071
Reference context extracted for all sites
Added reference base and context information to rowData
Calculating basic statistics...
Total processing time: 11.69237 secs
Created methrix object: 405071 sites × 3 samples
Context distribution:
CG: 405,071
Am I missing something?
Hi, Thanks for testing. Could you please install again from modkit branch and try again? There were bunch of bugs with the tabix file streaming that caused the issue. I have fixed them most. There are additional features to read in only specific chromosomes or genomic ranges. Let me know if it works as expected.
# Basic 5mC analysis (chromosome sizes file required)
meth <- read_modkit(modkit_files, chrom_sizes = "hg38.chrom.sizes", target_mod = "m")
# With context extraction
meth <- read_modkit(modkit_files, chrom_sizes = "hg38.chrom.sizes",
ref_fasta = "hg38.fa", target_mod = "m")
# Large dataset with HDF5 backend (memory efficient)
meth <- read_modkit(modkit_files, chrom_sizes = "hg38.chrom.sizes",
h5 = TRUE, h5_dir = "./h5_data")
# With custom interval size for memory control
meth <- read_modkit(modkit_files, chrom_sizes = "hg38.chrom.sizes", interval_size = 5000000)
# Process only chr1 and chr2
meth <- read_modkit(modkit_files, chrom_sizes = "hg38.chrom.sizes",
chromosomes = c("chr1", "chr2"))
# Process specific genomic regions (BED file)
meth <- read_modkit(modkit_files, chrom_sizes = "hg38.chrom.sizes",
regions = "promoters.bed")
# Process custom regions (data.frame)
my_regions <- data.frame(
chr = c("chr1", "chr1", "chr2"),
start = c(1000000, 5000000, 2000000),
end = c(2000000, 6000000, 3000000)
)
meth <- read_modkit(modkit_files, chrom_sizes = "hg38.chrom.sizes",
regions = my_regions)