modkit icon indicating copy to clipboard operation
modkit copied to clipboard

modkit sample-probs with 2OmeACUG

Open kir1to455 opened this issue 7 months ago • 2 comments

Hi, @ArtRand

I encountered some issues while using modkit sample-probs with 2OmeACUG.

Here is my code: ${modkitDir}/modkit sample-probs ${bamfile}/input_merge_sup_m6A_pseU_m5C_inosine_2OmeACUG.mod.sorted.bam -t 40 --log-filepath ${bamfile}/input_merge_sup_2OmeA_sample_prob/input_merge_sup_2OmeA.sample_pob.log --percentiles 0.1,0.25,0.5,0.75,0.9 --out-dir ${bamfile}/input_merge_sup_2OmeA_sample_prob --hist --num-reads 20084 --include-bed ${index_dir}/A_0_transcripts.bed --only-mapped

${modkitDir}/modkit sample-probs ${bamfile}/input_merge_sup_m6A_pseU_m5C_inosine_2OmeACUG.mod.sorted.bam -t 40 --log-filepath ${bamfile}/input_merge_sup_2OmeG_sample_prob/input_merge_sup_2OmeG.sample_pob.log --percentiles 0.1,0.25,0.5,0.75,0.9 --out-dir ${bamfile}/input_merge_sup_2OmeG_sample_prob --hist --num-reads 20084 --include-bed ${index_dir}/G_0_transcripts.bed --only-mapped

I use probabilities.tsv to generate the plot. First and second image are 2OmeA. Image Image Third and fourth image are 2OmeG. Image Image

Why is there such a huge difference between 2OmeA and 2OmeG?

Best wishes, Kirito

kir1to455 avatar May 26 '25 07:05 kir1to455

The primary difference here is the number of modified bases predicted by the models. The G model has only the 2'Ome modified bases predicted output. While the A mods model has 2'Ome along with m6A and inosine output bases. This results in a larger range of possible outputs and associated distributions. If you use modkit to ignore m6A and inosine, I imagine that you might find that the distributions are much more similar; assuming there are not real 2'Ome bases in this sample and that this an interrogation of the false positive distribution of probabilities. I hope this helps, but please reach out if mote clarification would help.

marcus1487 avatar May 27 '25 18:05 marcus1487

Hi, @marcus1487 Thank you for your prompt reply.

I only Use 2'Ome modification to plot the probabilities. I have taken a portion of the decoration image.

The primary difference here is the number of modified bases predicted by the models. The G model has only the 2'Ome modified bases predicted output. While the A mods model has 2'Ome along with m6A and inosine output bases. This results in a larger range of possible outputs and associated distributions. If you use modkit to ignore m6A and inosine, I imagine that you might find that the distributions are much more similar; assuming there are not real 2'Ome bases in this sample and that this an interrogation of the false positive distribution of probabilities. I hope this helps, but please reach out if mote clarification would help.

The first image only A bases from counts.html . Image

The second image only G bases from counts.html . Image It can be seen that in a considerable part of the area, there is no 2'Ome G modification.

After using modkit sample-probs, I used modkit pileup to 2'Ome modification and set 0.97 for --mod-thresholds. Here is my code for 2'OmeU and the same --mod-thresholds to other modifications.: ${modkitDir}/modkit pileup input_merge_sup_m6A_pseU_m5C_inosine_2OmeACUG.mod.sorted.bam input_merge_sup.pass.2OmeU.bed --ref gencode.vM33.normal.transcripts.fa --include-bed pseU_0_transcripts.bed --motif T 0 --log-filepath input_merge_sup_2OmeU.log --num-reads 20084 --max-depth 20000 --filter-threshold T:0.9 --mod-thresholds 19227:0.97 -t 40 awk '{if($4==19227) print$0}' input_merge_sup.pass.2OmeU.bed > input_merge_sup.pass.2OmeU.filter.bed

I use coverage >= 20 and modnum >=20 and site ratio >= 0.1 for each 2'Ome sites. However, when I mapped the 2'Ome sites to the genome, I found that most of the sites were enriched at the 3 'UTR end. This leaves me very confused. Image

Best wishes, Kirito

kir1to455 avatar May 29 '25 14:05 kir1to455