modkit icon indicating copy to clipboard operation
modkit copied to clipboard

Error and No Motif Output in modkit motif search Using Direct RNA-seq Data

Open Seongmin-Jang-1165 opened this issue 6 months ago • 2 comments

Dear developer,

I am currently analyzing m6A-seq data generated from direct RNA sequencing and attempting to use modkit motif search for motif discovery. However, I have encountered some issues that I would like to ask for your guidance on.

When I run the modkit motif search command using the original bedmethyl file, the command executes without errors, but no motifs are extracted. I suspect this might be due to:

  1. The library was constructed targeting a specific subset of transcripts rather than the full transcriptome.

  2. As a result, the data yield and quality may not be optimal.

To address this, I extracted only the entries corresponding to the transcripts of interest from the original bedmethyl file and created a new, filtered BED file. However, when I run modkit motif search with this new BED file, I receive the following error:

Error! failed to parse any bedmethyl records

Here are some points I would appreciate clarification on:

  1. In the BED file, the first column contains transcript IDs rather than chromosome names. I suspect this is because I used GRCh38.primary_assembly.transcriptome.mmi as the reference during dorado_aligner alignment. Could this be the cause of the issue when performing motif search?

  2. I have tested both the transcriptome and genome FASTA references for motif search with the original BED file. The command runs without error, but again, no motifs are extracted—possibly because the transcripts I am targeting represent only a very small portion of the entire dataset.

  3. When modifying the BED file to extract specific transcript entries, I ensured that the format was preserved and consistent with the original.

My main questions are:

  1. Does the reference FASTA used for motif search need to exactly match the transcript/chromosome IDs used in the BEDmethyl file?

  2. Is it possible to perform motif search on a subset of transcripts using such a filtered BEDmethyl file, and if so, how should the references and format be configured to avoid parsing errors?

Thank you very much for your time and support. I look forward to your response.

Seongmin-Jang-1165 avatar Jun 03 '25 04:06 Seongmin-Jang-1165

Hello @Seongmin-Jang-1165,

When I run the modkit motif search command using the original bedmethyl file, the command executes without errors, but no motifs are extracted. I suspect this might be due to: The library was constructed targeting a specific subset of transcripts rather than the full transcriptome. As a result, the data yield and quality may not be optimal.

You may want to check the valid coverage column 10 on the transcripts that you're targeting. The default minimum coverage is 5, so if you don't have at least that many valid calls at most of the sites, it's likely that most or a lot of the bedMethyl records are being discarded.

To address this, I extracted only the entries corresponding to the transcripts of interest from the original bedmethyl file and created a new, filtered BED file. However, when I run modkit motif search with this new BED file, I receive the following error:

How did you perform the filtering? Could you send me the full log with the entire error? You might consider trying the --contig option with one of the transcripts that you're targeting (using the original bedMethyl table). Probably a less error-prone way to subset which transcripts to search is to subset the input reference FASTA to only transcripts you are targeting.

In the BED file, the first column contains transcript IDs rather than chromosome names. I suspect this is because I used GRCh38.primary_assembly.transcriptome.mmi as the reference during dorado_aligner alignment. Could this be the cause of the issue when performing motif search?

You should use the same reference that you aligned to, so if you aligned to the transcriptome, you should use that for modkit motif search.

I have tested both the transcriptome and genome FASTA references for motif search with the original BED file. The command runs without error, but again, no motifs are extracted—possibly because the transcripts I am targeting represent only a very small portion of the entire dataset.

It's possible that if the "background" bedMethyl have non-specific or very low methylation that the motifs in these transcripts are diluting the real motifs that are on the transcripts that you're targeting. I would try using a subset of the FASTA as suggested above. Also if there is a motif you suspect to be present, you could try modkit motif evaluate and see what statistics it has. It's possible there aren't any strong motifs in the sample.

When modifying the BED file to extract specific transcript entries, I ensured that the format was preserved and consistent with the original.

If you show me the command you used for filtering and the exact error maybe I can tell you why it's not working.

Does the reference FASTA used for motif search need to exactly match the transcript/chromosome IDs used in the BEDmethyl file?

Yes, the reference names must exactly match.

Is it possible to perform motif search on a subset of transcripts using such a filtered BEDmethyl file, and if so, how should the references and format be configured to avoid parsing errors?

I would try using the --contig option first on a specific transcript that you suspect has a motif. Then filter the reference FASTA to include only transcript records that you want to search instead of filtering the bedMethyl table. If this is a common use case, I may consider adding an --include-bed option that will only search a subset of the input bedMethyl and references.

ArtRand avatar Jun 06 '25 21:06 ArtRand

@ArtRand Hello, thank you very much for your kind and detailed explanation.

First, after checking the valid coverage column, I found that the transcripts I was not targeting showed low values, whereas many of the transcripts I was targeting had relatively high values, although there were also some with low coverage.

I also try --contig option, there is no results.

Below, I will describe how I processed the data.

----------------------------------------preprocessing ---------------------------------------

>dorado basecaller sup,inosine_m6A barcode1.pod5 > DORADO_m6A_barcode1.bam >samtools sort DORADO_m6A_barcode1.bam -o DORADO_m6A_barcode1_sorted.bam >samtools index -b DORADO_m6A_barcode1_sorted.bam

I want to find de novo m6A calling, so I use Inosine_m6A option and ignore Inosine after process

>dorado aligner GRCh38.primary_assembly.transcriptome.mmi DORADO_m6A_barcode1.bam > m6A_barcode1_dorado_aligned.bam >samtools sort m6A_barcode1_dorado_aligned.bam -o m6A_barcode1_dorado_aligned_sorted.bam >samtools index -b m6A_barcode1_dorado_aligned_sorted.bam modkit pileup m6A_barcode1_dorado_aligned_sorted.bam m6A_barcode1.bed --log-filepath m6A_barcode1.log --ignore 17596

----------------------------------------Motif search----------------------------------------

  1. Try with Transcript.fa

modkit motif search --in-bedmethyl m6A_barcode1.bed.gz --ref gencode.v43.transcripts.fa -o motifs.tsv --threads 10 –log modkit_find_motifs_log.txt

Image

  1. Try with Gemome.fa

modkit motif search --in-bedmethyl m6A_barcode1.bed.gz --ref GRCh38.primary_assembly.genome.fa -o motifs.tsv --threads 10 –log modkit_find_motifs_log.txt

Image

  1. Try with Transcriptome.mmi

modkit motif search --in-bedmethyl m6A_barcode1.bed.gz --ref GRCh38.primary_assembly.transcriptome.mmi -o motifs.tsv --threads 10 --log modkit_find_motifs_log.txt

Image

----------------------------------------filter BED and retry----------------------------------------

  1. Make transcript ID list (modkit_Trscrpt_RDlist.txt)

  2. BED file filtering(I'll upload the file in excel format_m6A_barcode1_RD.bed.xlsx)

grep -F -f modkit_Trscrpt_RDlist.txt m6A_barcode1.bed > m6A_barcode1_Rdmotif.bed bgzip m6A_barcode1_Rdmotif.bed tabix -p bed m6A_barcode1_Rdmotif.bed.gz

Image

  1. Run mtif search

modkit motif search --in-bedmethyl m6A_barcode1_Rdmotif.bed.gz –ref gencode.v43.transcripts.fa -o motifs.tsv --threads 10 --log modkit_find_motifs_log.txt

Image

m6A_barcode1_RD.bed.xlsx

Thank you..!!

Seongmin-Jang-1165 avatar Jun 07 '25 04:06 Seongmin-Jang-1165