EarlGrey Out of Memory but the job seemed already finished

Hello!

I ran EarlGrey (v4.4.4) for multiple genomes (size between 500-600 Mb) using Slurm. Some jobs were completed but the others showed Out Of Memory (exit code 0).

For those OOM jobs, I checked the log file generated by earlGrey and it seemed that the pipeline has completed. Like the following:

       (   ) )
       ) ( (
     _______)_
  .-'---------|  
 ( C|/\/\/\/\/|
  '-./\/\/\/\/|
   '_________'
    '-------'
  <<< TE library, Summary Figures, and TE Quantifications in Standard Formats Can Be Found in ../01.04_hy_v4h2_EarlGrey/01.04_hy_v4h2_summaryFiles/ >>>

And the number of files in the summary folder is the same as those genomes with completed run.

ls -l 01.04_hy_v4h2_EarlGrey/01.04_hy_v4h2_summaryFiles/
total 175840
-rw-rw-r--. 1 cflthc powerplant      7979 Sep 26 06:18 01.04_hy_v4h2_classification_landscape.pdf
-rw-rw-r--. 1 cflthc powerplant    624387 Sep 26 06:18 01.04_hy_v4h2_divergence_summary_table.tsv
-rw-rw-r--. 1 cflthc powerplant   5679703 Sep 26 06:18 01.04_hy_v4h2-families.fa.strained
-rw-rw-r--. 1 cflthc powerplant    306155 Sep 26 01:58 01.04_hy_v4h2.familyLevelCount.txt
-rw-rw-r--. 1 cflthc powerplant  38064532 Sep 26 06:18 01.04_hy_v4h2.filteredRepeats.bed
-rw-rw-r--. 1 cflthc powerplant 111409661 Sep 26 06:18 01.04_hy_v4h2.filteredRepeats.gff
-rw-rw-r--. 1 cflthc powerplant       542 Sep 26 01:58 01.04_hy_v4h2.highLevelCount.txt
-rw-rw-r--. 1 cflthc powerplant      9184 Sep 26 06:18 01.04_hy_v4h2_split_class_landscape.pdf
-rw-rw-r--. 1 cflthc powerplant      8157 Sep 26 01:58 01.04_hy_v4h2.summaryPie.pdf
-rw-rw-r--. 1 cflthc powerplant     12325 Sep 26 06:18 01.04_hy_v4h2_superfamily_div_plot.pdf

What would be the cause of the OOM error? Which step is the most RAM-consuming step? Should I rerun EarlGrey for those having OOM error or ignore the OOM error? Or would it be a problem caused by our Slurm system instead?

p.s. I used 16 cores and 60G of RAM for each job.

Any guidance is much appreciated.

Cheers Ting-Hsuan

Sep 25 '24 22:09 ting-hsuan-chen

Update: I compared the two runs for the same genome. The first run was given 50G of RAM but failed with OOM. The second run was given 60G of RAM but completed. And I found that the size of the ouput files (especially the TE library and bed/gff files) in the summary folder was larger than those of the OOM run. So I guess I'll need to rerun earlGrey for the failed genome.

The file content in the summary folder of OOM run:

total 176712
-rw-rw-r--. 1 cflthc powerplant      7630 Sep 22 13:19 01.01_red5_v2_classification_landscape.pdf
-rw-rw-r--. 1 cflthc powerplant    619087 Sep 22 13:19 01.01_red5_v2_divergence_summary_table.tsv
-rw-rw-r--. 1 cflthc powerplant   5795045 Sep 22 13:19 01.01_red5_v2-families.fa.strained
-rw-rw-r--. 1 cflthc powerplant    303431 Sep 22 09:18 01.01_red5_v2.familyLevelCount.txt
-rw-rw-r--. 1 cflthc powerplant  37979930 Sep 22 13:19 01.01_red5_v2.filteredRepeats.bed
-rw-rw-r--. 1 cflthc powerplant 112088923 Sep 22 13:19 01.01_red5_v2.filteredRepeats.gff
-rw-rw-r--. 1 cflthc powerplant       489 Sep 22 09:18 01.01_red5_v2.highLevelCount.txt
-rw-rw-r--. 1 cflthc powerplant      8524 Sep 22 13:19 01.01_red5_v2_split_class_landscape.pdf
-rw-rw-r--. 1 cflthc powerplant      7878 Sep 22 09:18 01.01_red5_v2.summaryPie.pdf
-rw-rw-r--. 1 cflthc powerplant     12786 Sep 22 13:19 01.01_red5_v2_superfamily_div_plot.pdf

The file content in the summary folder of completed run:

total 175776
-rw-rw-r--. 1 cflthc powerplant      7674 Sep 25 21:23 01.01_red5_v2_classification_landscape.pdf
-rw-rw-r--. 1 cflthc powerplant    623389 Sep 25 21:23 01.01_red5_v2_divergence_summary_table.tsv
-rw-rw-r--. 1 cflthc powerplant   6183721 Sep 25 21:23 01.01_red5_v2-families.fa.strained
-rw-rw-r--. 1 cflthc powerplant    305545 Sep 25 16:16 01.01_red5_v2.familyLevelCount.txt
-rw-rw-r--. 1 cflthc powerplant  37714153 Sep 25 21:23 01.01_red5_v2.filteredRepeats.bed
-rw-rw-r--. 1 cflthc powerplant 111215427 Sep 25 21:23 01.01_red5_v2.filteredRepeats.gff
-rw-rw-r--. 1 cflthc powerplant       489 Sep 25 16:16 01.01_red5_v2.highLevelCount.txt
-rw-rw-r--. 1 cflthc powerplant      8563 Sep 25 21:23 01.01_red5_v2_split_class_landscape.pdf
-rw-rw-r--. 1 cflthc powerplant      7880 Sep 25 16:16 01.01_red5_v2.summaryPie.pdf
-rw-rw-r--. 1 cflthc powerplant     12335 Sep 25 21:23 01.01_red5_v2_superfamily_div_plot.pdf

Is there a way to resume earlGrey from where it failed?

Sep 25 '24 23:09 ting-hsuan-chen

Hi @ting-hsuan-chen!

In this case it is likely that the OOM step prevented proper processing during the divergence calculations, where the annotations are read into memory to calculate kimura divergence. It is probably worth rerunning these jobs just to make sure.

You can rerun the failed steps of EarlGrey here by deleting ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.bed, then resubmitting the job with exactly the same command line options as before. EarlGrey will then skip stages that are successfully completed, so in this case it should only rerun the defragmentation step and divergence calculations

Sep 26 '24 12:09 TobyBaril

Thank you @TobyBaril, I'll try it.

Sep 26 '24 20:09 ting-hsuan-chen

Hi @TobyBaril, turned out I restarted a fresh run using earlGreyLibConstruct because I only need the TE library files. The OOM error was raised for some genomes, with the end of log files looked the same as I mentioned previously. So I wanted to follow your instruction to rerun earlGreyLibConstruct for them. However there's no folder called ${OUTDIR}/${species}_mergedRepeats. Is this expected? Could you advice on how to resume these jobs using earlGreyLibConstruct? Thanks! :)

Oct 01 '24 04:10 ting-hsuan-chen

Hi @ting-hsuan-chen, the library construction terminates after TEstrainer, where the de novo libraries are generated. It will not run the final annotation and subsequent defragmentation. The idea with this subscript of Earl Grey is to generate the libraries, which can then be combined into a single non-redundant library that can be used to mask all the genomes at the end.

On this, I've made a note to add another subscript for the final annotation and defragmentation for the next release!

Oct 01 '24 09:10 TobyBaril

Thank you @TobyBaril ! earlGreyLibConstruct is exactly what I need - we are building a pan-TE library for multiple genomes. I have some follow up questions.

I allocated 10 cpus and a total of 100G of memory for each genome (each about 500-600Mb in size). For some genomes, I still got the "Out Of Memory" error from slurm when using earlGreyLibConstruct. But I didn't find any error message in the log file.

For your reference, the tail of the log file for the slurm job with OOM error was attached. It seemed that TEstrainer step has been completed? If not, how do I resume earlGreyLibConstruct to complete the job? Would the approach mentioned in https://github.com/TobyBaril/EarlGrey/issues/58#issuecomment-1757725110 suite my case?

Trimming and sorting based on mreps, TRF, SA-SSR
Removing temporary files
Reclassifying repeats
RepeatClassifier Version 2.0.5
======================================
  - Looking for Simple and Low Complexity sequences..
  - Looking for similarity to known repeat proteins..
  - Looking for similarity to known repeat consensi..
../01_EGLibConstruct/04.01_poly_v1_EarlGrey/04.01_poly_v1_strainer
Compiling library
    
              )  (
             (   ) )
             ) ( (
           _______)_
        .-'---------|  
       ( C|/\/\/\/\/|
        '-./\/\/\/\/|
         '_________'
          '-------'
        <<< Tidying Directories and Organising Important Files >>>
    
              )  (
             (   ) )
             ) ( (
           _______)_
        .-'---------|  
       ( C|/\/\/\/\/|
        '-./\/\/\/\/|
         '_________'
          '-------'
        <<< Done in 86:10:35.00 >>>
    
              )  (
             (   ) )
             ) ( (
           _______)_
        .-'---------|  
       ( C|/\/\/\/\/|
        '-./\/\/\/\/|
         '_________'
          '-------'
        <<< TE library in Standard Format Can Be Found in ../01_EGLibConstruct/04.01_poly_v1_EarlGrey/04.01_poly_v1_summaryFiles/ >>>

Oct 02 '24 00:10 ting-hsuan-chen

Hi!

You can resume by following the comment. In the case of the log you posted it looks as though the job should have successfully completed. OOM in slurm can be a bit odd though as it terminates the job without giving it a chance to finish nicely, so there's no guarantee the step it interrupted finished properly.

Oct 02 '24 14:10 TobyBaril

Hi,

I've also had the same issue when running EarlGrey on SLURM HPC, and what surprised me is that for the exactly same input file, one run consummed more than 300G of RAM (and generated an OOM error) while the second run ended without error and a maximum consumption of ca. 20Gb of RAM. On large genomes (>600Mb to 3Gb), I systematically obtain OOM errors (different number of tasks being killed for OOM) woth EarlGrey requiring huge amount of RAM (more than 500Gb). However, maskings still finished.

Oct 28 '24 11:10 jeankeller

Hi @jeankeller,

This is strange...are you happy to provide the log files from both runs for us to take a look at?

Oct 28 '24 13:10 TobyBaril

Hi @TobyBaril Thanks for your answer. As it was few months ago, the log file of the failed run has been removed. I can still share the one for the run that worked but I'm not sure that will be useful... I was a bit surprised that whatever the genome is in terms of size, there are OOM errors returned by SLURM (slurmstepd: error: Detected 1 oom_kill event in StepId=12315883.batch. Some of the step tasks have been OOM Killed), the number of tasks vary according the run. I can share any logs with you if you want. [EDIT] I just realized that we are running through conda, could it be related to conda?

Best, Jean

Oct 28 '24 13:10 jeankeller

Hi Jean,

It shouldn't be related to conda. It might be related to the divergence calculations that run several in parallel, although this shouldn't really cause any issues until we get to files with millions of TE hits...I'll continue trying to narrow this down - it is a strange one as nothing in the core pipeline has changed for several iterations now!

Nov 14 '24 12:11 TobyBaril

Hi Toby,

Yes, it is weird. The HPC team installed EarlGrey as a SLURM module instead of an user-conda environment and on the tests I have run, it looks like the error has gone. I am running more tests on species with different genome size to confirm the pattern. I can share with you the log of the failed run (under conda environment) that used more than 300Gb of RAM. We redid it the exact same way and it consumed only 10-15Gb of RAM. Best Jean

Nov 15 '24 07:11 jeankeller

Hi Toby, It seems to me that the divergence calculations are not the cause. I've been using conda environment and submitting jobs to SLURM, and I only use earlGreyLibConstruct which doesn't include divergence calculation. The huge RAM consumption issue persists. I've run earlGreyLibConstruct on several plant genomes separately, each around 500-600Mb. I kept getting OOM errors for some, and therefore needed to empty the "strainer" folder and resume the analysis with more RAM. Some runs used <150G of RAM, while others needed 300G or more. @jeankeller It's great to know that a SLURM module installation might solve the problem. I'll contact our HPC team and see if that can be done on our side.

Cheers Ting-Hsuan

Nov 17 '24 23:11 ting-hsuan-chen

Okay so this looks like it could be linked to something in TEstrainer, or potentially a conda module. @jamesdgalbraith might be able to provide more information on specific sections of TEstrainer that could be the culprit, but we will look into it

Nov 18 '24 10:11 TobyBaril

The memory-hungry stages of TEstrainer is the multi-sequence alignments using MAFFT, and the amount of memory used can vary between runs on the same genome depending on several factors including what repeats that particular run of RepeatModeler found (the seed it uses varies), especially if it detects a satellite repeats, as constructing MSAs of long arrays of tandem repeats is very memory-hungry. This may be what you're encountering @jeankeller . Unfortunately I don't currenty have a fix for this, but have been exploring potential ways of overcoming this issue.

In the first jobs you mentions @ting-hsuan-chen I don't think it's OOM in TEstrainer due to the presence of the 01.01_red5_v2-families.fa.strained file in the summary folder. In testing I've found that if TEstrainer causes an OOM error EarlGrey will cease at TEstrainer and not continue with the RepeatMasker annotation and tidy up.

Nov 19 '24 14:11 jamesdgalbraith

Hi all, So I've run a test with a large genome (4Gb) and after 12 days of run, it ended with OOM error. This was the EarlGrey v5.0.0 installed and set up as a SLURM module. It consumed more than 120Gb of RAM but produced all the expected outputs. End of the SLURM log file is: slurmstepd: error: Detected 23 oom_kill events in StepId=13250633.batch. Some of the step tasks have been OOM Killed. Full log available if needed.

Nov 28 '24 08:11 jeankeller

Hi @jeankeller,

thanks for the update! It would be great to have the log file if possible. I’ve had a chat with @jamesdgalbraith and we think this might be related to running several instances of MAFFT in parallel, particularly on big genomes and generating alignments for families with high copy #. We are working on refining the memory management but it would still be useful to check the logs to make sure this is indeed the issue that you have faced with this genome.

Thanks!

Nov 28 '24 08:11 TobyBaril

Hi @TobyBaril, Thanks for the answer! Good to hear that you have a clue about what could cause the issue. How do you want me to transfer the log file (it's about 240Mb)? Jean

Nov 28 '24 08:11 jeankeller

Hey all!

Just a quick update on this - we have tracked this to the memory-hungry EMBOSS water implementation, which also seems to be the root of #163. We are working on a suitable replacement for this in the divergence calculator and will update when we have developed and tested a more memory-friendly implementation.

Dec 11 '24 14:12 TobyBaril

Thanks @TobyBaril! I'm looking forward to it!!!

Dec 11 '24 20:12 ting-hsuan-chen

This has been pushed in release 5.1.0. I will add more patch notes shortly, but the main change is a shift away from using water for alignments to using matcher as recommended for longer alignments by EMBOSS. This change will result in a slight shift in kimura distance calculations towards slightly lower divergence due to the use of local similarity calculations employed in matcher. Therefore, use with caution if you are going to compare to previous runs with older versions of Earl Grey.

Dec 12 '24 15:12 TobyBaril

Hi Toby, That's awesome! Thanks for this fast update. Are the coming patches major patches? or should we already use this 5.1.0 version?

Dec 12 '24 15:12 jeankeller

Version 5.1.0 contains the changes that should solve these issues. It is currently waiting for approval in bioconda so should be live later this evening or tomorrow depending on when it gets a review (it has already passed all the appropriate tests).

Dec 12 '24 16:12 TobyBaril

excellent, thank you! I'll try it asap :)

Dec 12 '24 16:12 jeankeller

This is awesome!

Dec 12 '24 20:12 ting-hsuan-chen

Hi,

Happy new year 2025! I've tested the v5.1.0 of EarlGrey on a 4Gb genome and the OOM error is still present: "slurmstepd: error: Detected 15 oom_kill events in StepId=14158717.batch. Some of the step tasks have been OOM Killed." while the softmasked genome is produced. EarlGrey was installed through Miniconda.

Best, Jean

Jan 06 '25 07:01 jeankeller

Hi all,

I have the OOM issue with slurm also.

Running earlGreyLibConstruct with 250G mem and 32 CPUS using apptainer with slurm. Large and highly repetitive genome, so expecting it to take time. Using version 5.1.0 Job ran out of memory after 14 days

The 5 rounds of RepeatModeler have completed.

It seems like the OOM kill event occurred during the strainer step (at 5 am this morning). These are the files that are present in the strained dir:

├── AXX_polished_unitigs-families.fa.strained
└── TS_AXX_polished_unitigs-families.fa_6326
    ├── AXX_polished_unitigs-families.fa
    ├── AXX_polished_unitigs-families.fa.bak
    ├── AXX_polished_unitigs-families.fa.strained (5 am)
    ├── classify
    │   ├── AXX_polished_unitigs-families.fa.nonsatellite
    │   ├── AXX_polished_unitigs-families.fa.nonsatellite.classified (5 am)
    │   ├── tmpConsensi.fa
    │   └── tmpConsensi.fa.cat.gz
    ├── missing_consensi.txt
    └── trf
        ├── AXX_polished_unitigs-families.fa
        ├── AXX_polished_unitigs-families.fa.mreps
        ├── AXX_polished_unitigs-families.fa.nonsatellite
        ├── AXX_polished_unitigs-families.fa.sassr
        ├── AXX_polished_unitigs-families.fa.satellites
        └── AXX_polished_unitigs-families.fa.trf

And in the summary dir, also with the 5 am timestamp:

AXX_polished_unitigs-families.fa.strained

It seems that strainer has completed and then immediately the next step has failed. Tail of the log file:

Trimming and sorting based on mreps, TRF, SA-SSR
Warning message:
Failed to locate timezone database 
Removing temporary files
Reclassifying repeats
RepeatClassifier Version 2.0.6
======================================
  - Looking for Simple and Low Complexity sequences..
  - Looking for similarity to known repeat proteins..
  - Looking for similarity to known repeat consensi..
/nesi/nobackup/ga02470/acanthoxyla/AXX/asm4_raft_q20_nhap3_cov43_polishing_earlgrey/earlGreyOutputs/AXX_polished_unitigs_EarlGrey/AXX_polished_unitigs_strainer
Compiling library
    
              )  (
             (   ) )
             ) ( (
           _______)_
        .-'---------|  
       ( C|/\/\/\/\/|
        '-./\/\/\/\/|
         '_________'
          '-------'
        <<< Tidying Directories and Organising Important Files >>>
/usr/local/bin/earlGreyLibConstruct: line 193: bc: command not found
/usr/local/bin/earlGreyLibConstruct: line 194: bc: command not found
/usr/local/bin/earlGreyLibConstruct: line 195: bc: command not found
    
              )  (
             (   ) )
             ) ( (
           _______)_
        .-'---------|  
       ( C|/\/\/\/\/|
        '-./\/\/\/\/|
         '_________'
          '-------'
        <<< Done in 00:00:00.00 >>>
    
              )  (
             (   ) )
             ) ( (
           _______)_
        .-'---------|  
       ( C|/\/\/\/\/|
        '-./\/\/\/\/|
         '_________'
          '-------'
        <<< TE library in Standard Format Can Be Found in /nesi/nobackup/ga02470/acanthoxyla/AXX/asm4_raft_q20_nhap3_cov43_polishing_earlgrey/earlGreyOutputs/AXX_polished_unitigs_EarlGrey/AXX_polished_unitigs_summaryFiles/ >>>

Because the families.fa.strained file is present in the summaryFiles dir I wasn't sure if I should be removing the contents of the strained dir to resume, because perhaps it actually completed. BUT the timestamp on that summary file is when the job was killed.

Any idea what the problem was with lines 193-195 in the script for me?

Many thanks, Gemma

Feb 04 '25 22:02 gemmacol

Thanks for sharing @gemmacol. This looks like TEstrainer finished as expected (@jamesdgalbraith could confirm for sure). The OOM is potentially happening during the file tidying steps of TEstrainer. In your tree above, can you check the size of AXX_polished_unitigs-families.fa.strained and AXX_polished_unitigs-families.fa.nonsatellite.classified. They should be the same size, or AXX_polished_unitigs-families.fa.strained should be a little larger if the consolidation has finished. If it is larger, I would recommend checking that the tailing sequences are the satellites (grep ">" AXX_polished_unitigs-families.fa.strained | tail). Then, check that these seqs are the same as the ending sequence headers of AXX_polished_unitigs-families.fa.satellites using grep ">" AXX_polished_unitigs-families.fa.satellites | tail. If this is the case, then TEstrainer finished and the issue is further down the pipeline.

This should help us to narrow down exactly where the issue is. Unfortunately, this seems to mainly be an issue on (very) large and/or highly repetitive genomes, so is taking us some time to work out exactly what is happening.

Feb 05 '25 08:02 TobyBaril

Hi Toby,

I have checked the two things you suggest and indeed it seems that TEstrainer finished and the issue is further down the pipeline.

AXX_polished_unitigs-families.fa.strained = 45581630

AXX_polished_unitigs-families.fa.nonsatellite.classified = 45150251

grep ">" AXX_polished_unitigs-families.fa.strained | tail

>rnd-1_family-523#Satellite
>rnd-3_family-1008#Satellite
>rnd-1_family-245#Satellite
>rnd-2_family-802#Satellite
>rnd-1_family-359#Satellite
>rnd-4_family-1114#Satellite
>rnd-5_family-1376#Satellite
>rnd-5_family-6809#Satellite
>rnd-5_family-2170#Satellite
>rnd-1_family-387#Satellite

grep ">" AXX_polished_unitigs-families.fa.satellites | tail

>rnd-1_family-523#Satellite
>rnd-3_family-1008#Satellite
>rnd-1_family-245#Satellite
>rnd-2_family-802#Satellite
>rnd-1_family-359#Satellite
>rnd-4_family-1114#Satellite
>rnd-5_family-1376#Satellite
>rnd-5_family-6809#Satellite
>rnd-5_family-2170#Satellite
>rnd-1_family-387#Satellite

Is there a way I can just test the next step? Because of the error in the log file:

        <<< Tidying Directories and Organising Important Files >>>
/usr/local/bin/earlGreyLibConstruct: line 193: bc: command not found
/usr/local/bin/earlGreyLibConstruct: line 194: bc: command not found
/usr/local/bin/earlGreyLibConstruct: line 195: bc: command not found

And is the file AXX_polished_unitigs_summaryFiles/AXX_polished_unitigs-families.fa.strained the same as AXX_polished_unitigs_strainer/AXX_polished_unitigs-families.fa.strained? Because I deleted the contents of summaryFiles to prepare for resubmitting the job but now want to get it back just by copying the file from the strainer dir.

Also do you know Dr Sarah Semeraro from your building? If you see her say hi from me!

Many thanks, Gemma

Feb 05 '25 19:02 gemmacol

Hi @gemmacol,

Thanks for the update! In this case, I reckon the culprit might be the divergence calculator as this performs lots of alignments, which gets quite memory hungry on very large annotation files (i.e lots of alignments to do). The bc error can be safely ignored - this is just a timer function to tell you how long the pipeline took, but bc is not found in containerised environments. Earl Grey will skip completed steps automatically if the same command is run again and the expected files are present. In this case, if you rerun the same command it should detect all the files (it doesn't matter about those in summaryFiles as these are regenerated anyway).

All steps will be skipped up to the defragmentation step. My guess is the OOM might occur in this or the following step. If we can narrow this down, it might help us to come up with a solution for large genomes.

I haven't run into Sarah, but if I do I'll say hello!

Best Wishes,

Toby

Feb 07 '25 08:02 TobyBaril