methrix methrix for ONT support

Hi @tkik

We should think of supporting ONT methylation files. An additional argument to pipeline to import modkit output would be an easier solution. Clarissa is also quite interested in this.

Are you by any chance working with ONT or similar long-read data? We should talk about it.

Apr 03 '25 07:04 PoisonAlien

Hi @PoisonAlien,

Good idea! I am working with ONT data a lot, I often read in modkit output. I just have to add it to methrix. Let's talk once, I have some ideas :) please write me an email so we can set up a meeting.

Apr 07 '25 14:04 tkik

May I take the liberty of asking about the follow-up results? Because I am also trying to visualize the results of modkit or ONT again;

Jun 21 '25 03:06 xinghui-guo

Hi @xinghui-guo ,

We are working on it. It should be available soon.

Jun 24 '25 13:06 PoisonAlien

Hi Reka,

I have an implementation of native support for ModKit bedMethyl files through a new read_modkit() function on the modkit branch.

Initially, I tried implementing similar to how we process bedgraphs files. It works, but it chokes when the file size becomes larger. Especially with bedMethyl files containing 5mC, 5hmC, 6mA, 5fC, 5caC, etc. in a single file.

I gave up on native R implementation and switched to C code. Since bedMethyl files can be tabix-indexed, the code makes use of this. It will create one if it does not exist. Only requirement is that input files must be bgzip.

In my testing, it processes multiple 12GB bedMethyl files under a minute and consumes very little memory. So it's fast. You can specify what kind of modification to import with target_mod argument. Supports "m" (5mC), "h" (5hmC), "a" (6mA), "c" (5fC), "g" (5caC). Default "m".

When you provide the ref fasta file, it additionally extracts ref base and the surrounding context, adds it to the rowData()slot.

I remember you mentioned that you have a working code. Could you please try this and compare it with your approach?

# install
remotes::install_github(repo = "CompEpigen/methrix@modkit")

> m = read_modkit(files = c(BDG1, BDG2), target_mod = "m", ref_fasta = FA)
Reading ModKit bedMethyl files...
Files: 2
Target modification: 5-methylcytosine (5mC)
Samples: 2
Reference FASTA: hg38.p14.fa
Will extract reference base and sequence context
Auto-discovering chromosomes from 2 files...
  Scanning /omics/odcf/analysis/OE0290_projects/pediatric_tumor/2025-06-23_nb_newborn/05-tmp/mkit_merge/Dimelo_1_20241001_PAW42303_ungrouped.wf_mods.bedmethyl.gz...
  Scanning /omics/odcf/analysis/OE0290_projects/pediatric_tumor/2025-06-23_nb_newborn/05-tmp/mkit_merge/Dimelo_2_20241014_PAW37404_ungrouped.wf_mods.bedmethyl.gz...
Discovered 24 chromosomes with data
Generated 24 bins of size 1000000 covering data regions
Processing 24 bins sequentially...
Reference FASTA provided - extracting sequence context...
C processing completed in: 2.426025 secs
Sites discovered: 181,178
Reference context extracted for all sites
Added reference base and context information to rowData
Calculating basic statistics...
Total processing time: 3.560264 secs
Created methrix object: 181178 sites × 2 samples
Context distribution:
  CG: 181,178

> m
An object of class methrix
   n_CpGs: 181,178
n_samples: 2
    is_h5: FALSE
Reference: hg38.p14.fa


> rowData(m)
DataFrame with 181178 rows and 5 columns
               chr     start      strand    ref_base     context
       <character> <integer> <character> <character> <character>
1             chr1     10468           .           C          CG
2             chr1     10470           .           C          CG
3             chr1     10488           .           C          CG
4             chr1     10492           .           C          CG
5             chr1     10496           .           C          CG
...            ...       ...         ...         ...         ...
181174        chrY   2999329           .           C          CG
181175        chrY   2999368           .           C          CG
181176        chrY   2999371           .           C          CG
181177        chrY   2999393           .           C          CG
181178        chrY   2999592           .           C          CG

Jun 27 '25 12:06 PoisonAlien

That's great, looking forward to you releasing the latest version results! @PoisonAlien

Jun 27 '25 13:06 xinghui-guo

Hi Anand,

It looks interesting! We were focusing only on 5mC or 5hmC, so your approach seems superior to that. We will give it a try. Do you think it would make sense to allow a methrix object to contain multiple modifications? For example all the 5C ones? We can easily implement the possibility of using additional assays.

Jun 27 '25 23:06 tkik

Hi Reka,

Yes, I think it should be doable. Modkit's bedMethyl tools have nice options as well (merging similar mods).

Jun 30 '25 07:06 PoisonAlien

Hi Reka,

Did you get a chance to test it? If this works for you, we can push it to Bioconductor for the upcoming October release.

Sep 16 '25 08:09 PoisonAlien

Hi, I am testing it now. I am using DNA methylation from cfDNA, so not all the sites are covered. However, I am wondering why I end up with such a low number of sites. Trying with 3 files ranging from 2.5-14 million CG m sites.

meth <- read_modkit(bed_files[2:4], combine_strands = T, n_cores = 10, ref_fasta = "Q:/genome.fa", quality_filter = 0, min_coverage = 1)
Reading ModKit bedMethyl files...
Files: 3
Target modification: 5-methylcytosine (5mC)
Samples: 3
Reference FASTA: genome.fa
Will extract reference base and sequence context
Parallel cores: 10
Auto-discovering chromosomes from 3 files...
  Scanning W:/PEAR-OP/LCS/BED/20250716_1130_MN48054_FBA70268_464b4301/LCS_1542418.bed.bgz...
  Scanning W:/PEAR-OP/LCS/BED/20250723_0949_MN48054_FBA76850_bdb6eb3f/LCS_159286.bed.bgz...
  Scanning W:/PEAR-OP/LCS/BED/20250731_1129_MN48054_FAY20326_0ed7aa9e/LCS_360272.bed.bgz...
Discovered 25 chromosomes with data
Generated 26 bins of size 1000000 covering data regions
Processing 26 bins sequentially...
Reference FASTA provided - extracting sequence context...
C processing completed in: 11.543 secs
Sites discovered: 405,071
Reference context extracted for all sites
Added reference base and context information to rowData
Calculating basic statistics...
Total processing time: 11.69237 secs
Created methrix object: 405071 sites × 3 samples
Context distribution:
  CG: 405,071

Am I missing something?

Sep 25 '25 14:09 tkik

Hi, Thanks for testing. Could you please install again from modkit branch and try again? There were bunch of bugs with the tabix file streaming that caused the issue. I have fixed them most. There are additional features to read in only specific chromosomes or genomic ranges. Let me know if it works as expected.

# Basic 5mC analysis (chromosome sizes file required)
meth <- read_modkit(modkit_files, chrom_sizes = "hg38.chrom.sizes", target_mod = "m")

# With context extraction
meth <- read_modkit(modkit_files, chrom_sizes = "hg38.chrom.sizes",
                       ref_fasta = "hg38.fa", target_mod = "m")

# Large dataset with HDF5 backend (memory efficient)
meth <- read_modkit(modkit_files, chrom_sizes = "hg38.chrom.sizes",
                       h5 = TRUE, h5_dir = "./h5_data")

# With custom interval size for memory control
meth <- read_modkit(modkit_files, chrom_sizes = "hg38.chrom.sizes", interval_size = 5000000)

# Process only chr1 and chr2
meth <- read_modkit(modkit_files, chrom_sizes = "hg38.chrom.sizes",
                       chromosomes = c("chr1", "chr2"))

# Process specific genomic regions (BED file)
meth <- read_modkit(modkit_files, chrom_sizes = "hg38.chrom.sizes",
                       regions = "promoters.bed")

# Process custom regions (data.frame)
my_regions <- data.frame(
  chr = c("chr1", "chr1", "chr2"),
  start = c(1000000, 5000000, 2000000),
  end = c(2000000, 6000000, 3000000)
)
meth <- read_modkit(modkit_files, chrom_sizes = "hg38.chrom.sizes",
                       regions = my_regions)

Oct 01 '25 12:10 PoisonAlien