modkit icon indicating copy to clipboard operation
modkit copied to clipboard

Out of memory in modkit find-motifs

Open hannan666666 opened this issue 1 year ago • 4 comments

my code: $modkit find-motifs
-i ${modkit_d}/${sample}.pileup.bed.gz
-r ${ref}
-o ${modkit_d}/${sample}.denovo-motifs.stats.1009533.tsv
--threads 20
--min-coverage 10
--min-frac-mod 0.95
--exhaustive-seed-min-log-odds 5
--exhaustive-seed-len 2
--log ${modkit_d}/${sample}.denovo-motifs.stats.1009553.tsv.log

I'm encountering an issue where my process is frequently being killed due to memory exhaustion. I've followed the recommendations in the manual and adjusted the parameters as suggested, but the issue persists. Here is the relevant log snippet: Dec 24 21:43:41 RS720A-E12-RS12 kernel: [2176946.661901] Out of memory: Killed process 1503773 (modkit) total-vm:521209448kB, anon-rss:471939852kB, file-rss:279552kB, shmem-rss:0kB, UID:1005 pgtables:930900kB oom_score_adj:0

Could you please advise on any additional steps I can take to prevent this from happening? I appreciate any suggestions or insights you can offer.

hannan666666 avatar Dec 27 '24 07:12 hannan666666

Hello @hannan666666,

That seems like quite a lot. In your command, is ${modkit_d}/${sample}.pileup.bed.gz actually a bgzip-compressed bedMethyl file? This command expects an uncompressed bedMethyl file, in a quick test when I pass compressed input, the program halts saying that it couldn't successfully parse any bedMethyl records. Could you tell me the size of the uncompressed bedMethyl file and the reference sequence you're using? Also what is the compute environment?

ArtRand avatar Dec 30 '24 17:12 ArtRand

Thank you for your response. Yes, ${modkit_d}/${sample}.pileup.bed.gz is a bgzip-compressed bedMethyl file. The uncompressed file is approximately 100GB, and the reference sequence size is 2.6GB. My machine has 1.5TB of RAM and 384 CPU cores, but since it is shared by multiple users, the available memory is limited to 0.5TB.

When I split the ${sample}.pileup.bed.gz file with chromatin and use the --skip-research parameter for each chromatin data, the process completes successfully in about 20 minutes. However, without --skip-research, it has been running for over 2 days without finishing. Using different chromatin regions might result in some loss of information, so I was wondering if there is a more efficient method to reduce memory usage while processing the entire sample. I will also try using an uncompressed bedMethyl file as suggested. Happy new year! @ArtRand

hannan666666 avatar Dec 31 '24 12:12 hannan666666

Hello @hannan666666,

I see, you certainly have enough compute resources. Do you have an idea for the rough "overall modification" level in your sample? You could estimate this quickly with modkit summary. If you have very low modification rates the first round of "seed-based" searching will not eliminate many sequence contexts, leaving the exhaustive search many more contexts to evaluate. You may try adjusting the parameters as suggested here. I have some ideas for how to improve the scalability of this algorithm, but they're not ready yet.

ArtRand avatar Dec 31 '24 17:12 ArtRand

Thank you for your nice reply! This is the rough "overall modification" level in my sample: (base) hannan@RS720A-E12-RS12:/data3/hannan/data/nanopore_process/nanopore_04_modkit/WT_test$ cat WT_test.6mA_5mC5hmC.summary

bases A,C

total_reads_used 10042

count_reads_C 10028

count_reads_A 10041

pass_threshold_C 0.76171875

pass_threshold_A 0.8183594

base code pass_count pass_frac all_count all_frac C - 34035918 0.95676976 37163434 0.94260883 C h 211432 0.005943478 443679 0.011253421 C m 1326433 0.03728681 1819032 0.04613771 A - 47796464 0.9845493 51567510 0.95741826 A a 750078 0.015450698 2293496 0.042581752

i have adjusted the parameters as suggested here. looking forward your good news for the algorithm!

hannan666666 avatar Jan 02 '25 03:01 hannan666666