EDTA
EDTA copied to clipboard
SLURM-specific behavior: repeat content dramatically reduced but no errors
Our cluster moved from an LSF to a SLURM workload manager this year. I really enjoy using EDTA for our genome projects and ran it on a couple of assemblies while still working under LSF. Our total repeat content, as expected, was always in the same range as previous estimates from short-read data.
After the switch, I decided to re-run it on those same assemblies after making some small changes (fixing some minor SVs here and there) -- same install, same version as before, same script, same number of resources. But, all of a sudden, the total repeat content was dropping substantially and I can't find any indication as to why in the log files or .err/.out files. This pattern has held for multiple genomes across multiple accounts regardless of version or install type. Our IT Department looked into it and said it was likely something to do with how EDTA uses resources in SLURM, but could not find the root of the problem either. It was especially confusing that EDTA wasn't using all of the resources allocated to it.
I don't know how to address this or begin troubleshooting it. Do you have any idea what might be causing this behavior?
I've placed some examples from one of the genomes I'm working on below if it helps. Again, there's nothing in the .log, .err. or .out files -- according to those, it looks like the job was completed successfully.
For this genome, the expected total repeat content is ~30-33% based on previous runs of EDTA and estimates with GenomeScope.
This was the first attempt using the same script and install as it was prior to the LSF/SLURM switch (values in this range of 6-7% also occurred if I used the same resources, switched --ntasks to -c, and used Singularity instead):
#!/bin/sh
#SBATCH -e edta_test_%j.err
#SBATCH -o edta_test_%j.out
#SBATCH --job-name=edta_test
#SBATCH --time-min=120:00:00
#SBATCH --ntasks=25
#SBATCH --mem=80G
#SBATCH --partition=plant
#SBATCH --nodes=1
perl ~/mambaforge/envs/edta/bin/EDTA.pl --genome hap2_curated.FINAL.fasta --species others --anno 1 -t 25
and this is the SLURM output:
Job 1312854 (COMPLETED)
Name edta_test
Submit sbatch edta.sh
Nodes plant - plant02
Input /dev/null
Output [path to]/edta_test_1312854.out
Error [path to]/edta_test_1312854.err
Resources CPU = 25 Memory = 81920
Start 2023-08-01 13:37:40
End 2023-08-01 18:03:34
Elapsed 265.9 minutes
Limit 28800 minutes
Exit Code SUCCESS (0)
Usage:
min CPU = 89437.26 sec (1 day, 0:50:37.26, 22.42 %)
min Mem = 13133.449 MB (16.03 %)
max CPU = 89437.26 sec (1 day, 0:50:37.26, 22.42 %)
max Mem = 13133.449 MB (16.03 %)
average CPU = 89437.26 sec (1 day, 0:50:37.26, 22.42 %)
average Mem = 13133.449 MB (16.03 %)
total CPU = 89437.26 sec (1 day, 0:50:37.26, 22.42 %)
total Mem = 13133.449 MB (16.03 %)
and here's the EDTA output
Repeat Classes
==============
Total Sequences: 9
Total Length: 298741932 bp
Class Count bpMasked %masked
===== ===== ======== =======
LTR -- -- --
Copia 5290 7521628 2.52%
Gypsy 2279 2886915 0.97%
unknown 2123 1271613 0.43%
TIR -- -- --
CACTA 4085 2304875 0.77%
Mutator 6566 2950914 0.99%
PIF_Harbinger 1010 424154 0.14%
Tc1_Mariner 144 83628 0.03%
hAT 3496 2342891 0.78%
nonTIR -- -- --
helitron 1595 830065 0.28%
---------------------------------
total interspersed 26588 20616683 6.90%
---------------------------------------------------------
Total 26588 20616683 6.90%
Even though it wasn't using all of the memory provided to it, I wondered if it was a matter of resource allocation, so after slowly increasing it (especially the number of tasks-per-cpu), I was able to reproduce the total repeat content and ratios I expected with this run, but scaling the resources similarly for other larger genomes did not work:
#!/bin/sh
#SBATCH -e edta_singularity_%j.err
#SBATCH -o edta_singularity_%j.out
#SBATCH --job-name=edta_singularity
#SBATCH --time-min=120:00:00
#SBATCH -c 100
#SBATCH --mem=300G
#SBATCH --partition=plant
#SBATCH --nodes=1
module load cluster/singularity/3.11.0
export PYTHONNOUSERSITE=1
singularity exec [path to]/EDTA.sif EDTA.pl --genome hap2_curated.FINAL.fasta --anno 1
Here's the SLURM job output (again, not actually using much of the resources allocated):
Job 1322805 (COMPLETED)
Name edta_singularity
Submit sbatch edta.sh
Nodes plant - plant01
Input /dev/null
Output [path to]/edta_singularity_1322805.out
Error [path to]/edta_singularity_1322805.err
Resources CPU = 100 Memory = 307200
Start 2023-08-04 11:47:48
End 2023-08-04 19:17:14
Elapsed 449.43 minutes
Limit 28800 minutes
Exit Code SUCCESS (0)
Usage:
min CPU = 60512.09 sec (16:48:32.09, 2.24 %)
min Mem = 12943.504 MB (4.21 %)
max CPU = 60512.09 sec (16:48:32.09, 2.24 %)
max Mem = 12943.504 MB (4.21 %)
average CPU = 60512.09 sec (16:48:32.09, 2.24 %)
average Mem = 12943.504 MB (4.21 %)
total CPU = 60512.09 sec (16:48:32.09, 2.24 %)
total Mem = 12943.504 MB (4.21 %)
And finally, the EDTA .sum file output:
Repeat Classes
==============
Total Sequences: 9
Total Length: 298741932 bp
Class Count bpMasked %masked
===== ===== ======== =======
LTR -- -- --
Copia 35182 32118649 10.75%
Gypsy 19775 17604895 5.89%
unknown 13685 5906158 1.98%
TIR -- -- --
CACTA 27395 11192622 3.75%
Mutator 42888 14389429 4.82%
PIF_Harbinger 6605 2014056 0.67%
Tc1_Mariner 763 218663 0.07%
hAT 22646 10385771 3.48%
nonTIR -- -- --
helitron 11927 3650322 1.22%
---------------------------------
total interspersed 180866 97480565 32.63%
---------------------------------------------------------
Total 180866 97480565 32.63%