modkit localise arguments discrepancy question
Hello!
I'm unsure who to reach out to about this, but I'm trying to run the modkit localise (v.0.4.2 because the institution has not upgraded to 0.4.4 yet, but I submitted a request today for this) and I seem to be hitting a wall with this no matter what I input.
The github document does not feel completely clear to me regarding the required arguments, and when I have run modkit localise --help in the linux CLI the only argument it tells me is required is this:
Arguments: <IN_BEDMETHYL> Input bedMethyl table. Should be bgzip-compressed and have an associated Tabix index. The tabix index will be assumed to be $this_file.tbi
My questions are, perhaps:
- Why can't I use a bam file that has been aligned, sorted, and indexed prior to running the localise command?
- Do I need to put in the .gz file or the .tbi file as well into the localise command?
Regarding the subcommands:
- In the --regions subcommand, do I need to create a bed file for only the gene or chromosome that I am analyzing?
Hello @jmlavigne1,
I'm unsure who to reach out to about this,
You've come to the right place!
- Why can't I use a bam file that has been aligned, sorted, and indexed prior to running the localise command?
In theory, Modkit could generate the pileup in situ, and I've considered adding this option, but right now localize requires a compressed bedMethyl table. Generally speaking, there are two "flavors" of Modkit commands, ones that take modBAMs (pileup, sample-probs, entropy) and ones that use bedMethyl tables (localize, dmr).
- Do I need to put in the .gz file or the .tbi file as well into the localise command?
Just the .gz bedMethyl table, from the help:
Arguments:
<IN_BEDMETHYL> Input bedMethyl table. Should be bgzip-compressed and have an
associated Tabix index. The tabix index will be assumed to be
$this_file.tbi
Here's an example from the documentation:
# you can use `localize` also
$ modkit localise ${bedmethyl} --regions ${ctcf} --genome-sizes ${sizes}
Note that ${sizes} can be the .fai FASTA index.
Here is the code that I am trying, but I keep hitting a wall:
modkit localise 240222_barcode10_as_12MAR2025.bam.gz --regions 4:19031768-19033832 --genome-sizes Danio_rerio.GRCz11.dna.primary_assembly.tsv -o 240222_barcode10_LOCALISE_21MAR2025.tsv
And here is the error message that I keep receiving:
> loading sequence lengths from "Danio_rerio.GRCz11.dna.primary_assembly.tsv" /var/spool/slurmd/job62660994/slurm_script: line 16: 750619 Segmentation fault (core dumped) modkit localise 240222_barcode10_as_12MAR2025.bam.gz --regions 4:19031768-19033832 --genome-sizes Danio_rerio.GRCz11.dna.primary_assembly.tsv -o 240222_barcode10_LOCALISE_21MAR2025.tsv
I am not really certain how to interpret this error message and I have tried to adjust the code above without much luck.
Additionally, I have:
- checked to make sure that the tsv in the genome-sizes flag is 2 columns. I initially tried using a .fai fasta index in the --genome-size and it did not appear to work.
- made sure that the regions SN is correct (4:19031768-19033832 with no spaces and there is no "chrom" before 4).
- I have increased the CPUs
- and I have now changed the 'localise' to 'localize' to see if there is any difference.
Hello @jmlavigne1,
There are two problems with the command you have:
- The input needs to be a bedmethyl table, not a modBAM file.
- The
--regionsargument should be a BED file, not a "region string". So the content of that file would be (based on the BED format):
4 19031768 19033832
So the steps would be generally:
# 240222_barcode10_as_12MAR2025.bam needs to be sorted with an associated index at 240222_barcode10_as_12MAR2025.bam.bai
modkit pileup 240222_barcode10_as_12MAR2025.bam 240222_barcode10_as_12MAR2025_pileup.bed --threads 20 [options]
# where options could be --preset traditional --ref ${reference_fasta} or anything else
bgzip 240222_barcode10_as_12MAR2025_pileup.bed
tabix 240222_barcode10_as_12MAR2025_pileup.bed.gz
# regions.bed is the file above
modkit localise 240222_barcode10_as_12MAR2025_pileup.bed.gz \
--regions regions.bed \
--genome-sizes Danio_rerio.GRCz11.dna.primary_assembly.tsv \
-o 240222_barcode10_LOCALISE_21MAR2025.tsv
I'll fix the parsing so that you don't get a segfault (shouldn't happen), and that the command gives you a better error message.
Aside: If you want to speed this up, you could first make a BED file with the region you care about plus a little extra on the sides
$ bedtools slop -i regions.bed -b 2000 -g Danio_rerio.GRCz11.dna.primary_assembly.tsv > regions_slop2000.bed
Then pass regions_slop2000.bed to pileup with the --include-bed option:
modkit pileup 240222_barcode10_as_12MAR2025.bam 240222_barcode10_as_12MAR2025_pileup.bed \
--include-bed regions_slop2000.bed \
--threads 20 [options]
but this is not required.
(localize vs localise is just a little joke).
HI @ArtRand, thank you again! I was able to construct a modkit localise product file.
Hi @ArtRand,
I successfully created the file I wanted, and now I'm trying to add --chart to generate a visual. I receive error messages over and over when I input the --chart but indicating that I need to input a <CHART_FILEPATH> following the -o flag but it's been no good. Below is the line of code that I am stuck on.
modkit localize --chart -o localise_10_grin1b.png barcode10_dorado_pileup_4MAR2025_headless.bed.gz --regions grin1b_barcode10.bed --genome-sizes Danio_rerio.GRCz11.dna.primary_assembly.tsv
what do you recommend or what sticks out to you immediately?
Hello @jmlavigne1 The --chart option requires a path to write the HTML file to so you need:
modkit localize \
--chart localise_chart_file.html \ # <-- this part
-o localise_10_grin1b.png barcode10_dorado_pileup_4MAR2025_headless.bed.gz
--regions grin1b_barcode10.bed
--genome-sizes Danio_rerio.GRCz11.dna.primary_assembly.tsv
@ArtRand thank you again and I appreciate your explanations!
Hi @ArtRand another question came up recently related to this:
Does the region.bed in the modkit localise input align with the transcriptional start site with the origin of the x-axis of the graph shown in the online documentation? How does modkit determine this?
When I created a region.bed file for input I chose a gene, using the coordinate chromosome, starting point and end point in the animal model genome from UCSC genome browser. When I graphed this, I think that I incorrectly assumed that the "0" on the x-axis was the center of the gene, and that the gene was flanked by a range of bps (ex. 2000 toward 5' and 2000 toward 3').
Please let me know if this does not make sense and I will try to ask the question a different way.
Hello @jmlavigne1,
I think that I incorrectly assumed that the "0" on the x-axis was the center of the gene, and that the gene was flanked by a range of bps (ex. 2000 toward 5' and 2000 toward 3').
You're correct, this isn't exactly how it works, localize will take the midpoint of the BED regions then walk 2000bp towards the 5' and 2000bp towards the 3' to make the "window". You can change this number with the --window option, but it's not asymmetric - yet. If you're looking at a single gene, maybe extract just that region and convert it to bigWig (modkit bedmethyl tobigwig)? The use case behind localize is to look at patterns aggregated across multiple genomic features.
It's on the roadmap to make this command more like a "metagene" plot where the methylation pattern can be visualized across multiple genomic features of varying length.
Hi @ArtRand,
just a couple follow up questions:
1.) so if my region.bed is 4 19031768 19033832 are you saying that the "0" on the x-axis will be the midpoint between 19031768 19033832?
2.) What do you mean by "maybe extract just that region and convert it to bigWig (modkit bedmethyl tobigwig)" ? I assume that by creating a region.bed that was chromosome start point end point I had extracted just that region. So for the region.bed.
3.) So if the gene I am attempting to visualize is 4 19031768 19033832 should I use the --window option to fit these parameters?
thank you so much for all of your explanations so far - I find it all really helpful!
joe
Hello @jmlavigne1
1.) so if my region.bed is 4 19031768 19033832 are you saying that the "0" on the x-axis will be the midpoint between 19031768 19033832?
Correct.
Regarding (2) and (3):
If you are trying to visualize the modification information in a single region, what you want (I think) is a sequence of floats, each representing the %-modification at that a genomic position, from the start of the region to the end. Then you can plot this sequence on a genome browser or in any potting program. You can get this sequence a couple ways:
# make the bedMethyl table first (you probably did this already)
modkit pileup ${modbam} pileup.bed [options]
bgzip pileup.bed > pileup.bed.gz
tabix pileup.bed.gz
# extract the slice of data from your bedMethyl table
tabix pileup.bed.gz $chrom:$start-$end
# or
tabix pileup.bed.gz -R $regions.bed
# another way is to make a "bigWig" file of one of the tracks
tabix pileup.bed.gz $chrom:$start-$end | \
modkit bedmethyl tobigwig - ${output_bigwig} \
-g $genome.fasta.fai \
--mod-codes m
BigWig is a commonly used format for visualizing this kind of data.
The reason I would not recommend localize for this work is that the purpose of localize is to aggregate the methylation pattern over many regions. In your case, you only have one region, so just look at that. If you wanted to, say, look at all promoters, or all ChIP peaks, that's what localize is for.
Hope this helps.
A