EDTA
EDTA copied to clipboard
SLURM-specific behavior: repeat content dramatically reduced but no errors
Our cluster moved from an LSF to a SLURM workload manager this year. I really enjoy using EDTA for our genome projects and ran it on a couple of assemblies while still working under LSF. Our total repeat content, as expected, was always in the same range as previous estimates from short-read data.
After the switch, I decided to re-run it on those same assemblies after making some small changes (fixing some minor SVs here and there) -- same install, same version as before, same script, same number of resources. But, all of a sudden, the total repeat content was dropping substantially and I can't find any indication as to why in the log files or .err/.out files. This pattern has held for multiple genomes across multiple accounts regardless of version or install type. Our IT Department looked into it and said it was likely something to do with how EDTA uses resources in SLURM, but could not find the root of the problem either. It was especially confusing that EDTA wasn't using all of the resources allocated to it.
I don't know how to address this or begin troubleshooting it. Do you have any idea what might be causing this behavior?
I've placed some examples from one of the genomes I'm working on below if it helps. Again, there's nothing in the .log, .err. or .out files -- according to those, it looks like the job was completed successfully.
For this genome, the expected total repeat content is ~30-33% based on previous runs of EDTA and estimates with GenomeScope.
This was the first attempt using the same script and install as it was prior to the LSF/SLURM switch (values in this range of 6-7% also occurred if I used the same resources, switched --ntasks to -c, and used Singularity instead):
#!/bin/sh
#SBATCH -e edta_test_%j.err
#SBATCH -o edta_test_%j.out
#SBATCH --job-name=edta_test
#SBATCH --time-min=120:00:00
#SBATCH --ntasks=25
#SBATCH --mem=80G
#SBATCH --partition=plant
#SBATCH --nodes=1
perl ~/mambaforge/envs/edta/bin/EDTA.pl --genome hap2_curated.FINAL.fasta --species others --anno 1 -t 25
and this is the SLURM output:
Job 1312854 (COMPLETED)
Name edta_test
Submit sbatch edta.sh
Nodes plant - plant02
Input /dev/null
Output [path to]/edta_test_1312854.out
Error [path to]/edta_test_1312854.err
Resources CPU = 25 Memory = 81920
Start 2023-08-01 13:37:40
End 2023-08-01 18:03:34
Elapsed 265.9 minutes
Limit 28800 minutes
Exit Code SUCCESS (0)
Usage:
min CPU = 89437.26 sec (1 day, 0:50:37.26, 22.42 %)
min Mem = 13133.449 MB (16.03 %)
max CPU = 89437.26 sec (1 day, 0:50:37.26, 22.42 %)
max Mem = 13133.449 MB (16.03 %)
average CPU = 89437.26 sec (1 day, 0:50:37.26, 22.42 %)
average Mem = 13133.449 MB (16.03 %)
total CPU = 89437.26 sec (1 day, 0:50:37.26, 22.42 %)
total Mem = 13133.449 MB (16.03 %)
and here's the EDTA output
Repeat Classes
==============
Total Sequences: 9
Total Length: 298741932 bp
Class Count bpMasked %masked
===== ===== ======== =======
LTR -- -- --
Copia 5290 7521628 2.52%
Gypsy 2279 2886915 0.97%
unknown 2123 1271613 0.43%
TIR -- -- --
CACTA 4085 2304875 0.77%
Mutator 6566 2950914 0.99%
PIF_Harbinger 1010 424154 0.14%
Tc1_Mariner 144 83628 0.03%
hAT 3496 2342891 0.78%
nonTIR -- -- --
helitron 1595 830065 0.28%
---------------------------------
total interspersed 26588 20616683 6.90%
---------------------------------------------------------
Total 26588 20616683 6.90%
Even though it wasn't using all of the memory provided to it, I wondered if it was a matter of resource allocation, so after slowly increasing it (especially the number of tasks-per-cpu), I was able to reproduce the total repeat content and ratios I expected with this run, but scaling the resources similarly for other larger genomes did not work:
#!/bin/sh
#SBATCH -e edta_singularity_%j.err
#SBATCH -o edta_singularity_%j.out
#SBATCH --job-name=edta_singularity
#SBATCH --time-min=120:00:00
#SBATCH -c 100
#SBATCH --mem=300G
#SBATCH --partition=plant
#SBATCH --nodes=1
module load cluster/singularity/3.11.0
export PYTHONNOUSERSITE=1
singularity exec [path to]/EDTA.sif EDTA.pl --genome hap2_curated.FINAL.fasta --anno 1
Here's the SLURM job output (again, not actually using much of the resources allocated):
Job 1322805 (COMPLETED)
Name edta_singularity
Submit sbatch edta.sh
Nodes plant - plant01
Input /dev/null
Output [path to]/edta_singularity_1322805.out
Error [path to]/edta_singularity_1322805.err
Resources CPU = 100 Memory = 307200
Start 2023-08-04 11:47:48
End 2023-08-04 19:17:14
Elapsed 449.43 minutes
Limit 28800 minutes
Exit Code SUCCESS (0)
Usage:
min CPU = 60512.09 sec (16:48:32.09, 2.24 %)
min Mem = 12943.504 MB (4.21 %)
max CPU = 60512.09 sec (16:48:32.09, 2.24 %)
max Mem = 12943.504 MB (4.21 %)
average CPU = 60512.09 sec (16:48:32.09, 2.24 %)
average Mem = 12943.504 MB (4.21 %)
total CPU = 60512.09 sec (16:48:32.09, 2.24 %)
total Mem = 12943.504 MB (4.21 %)
And finally, the EDTA .sum file output:
Repeat Classes
==============
Total Sequences: 9
Total Length: 298741932 bp
Class Count bpMasked %masked
===== ===== ======== =======
LTR -- -- --
Copia 35182 32118649 10.75%
Gypsy 19775 17604895 5.89%
unknown 13685 5906158 1.98%
TIR -- -- --
CACTA 27395 11192622 3.75%
Mutator 42888 14389429 4.82%
PIF_Harbinger 6605 2014056 0.67%
Tc1_Mariner 763 218663 0.07%
hAT 22646 10385771 3.48%
nonTIR -- -- --
helitron 11927 3650322 1.22%
---------------------------------
total interspersed 180866 97480565 32.63%
---------------------------------------------------------
Total 180866 97480565 32.63%
I notice you switch from conda to singularity while increasing the memory allocation. The two may use different versions of EDTA, Repeatmasker, and rmblast. You may want to check the version of these packages between the two installation..
Shujun
@oushujun -- I'm sorry I didn't clarify this in the original post, but I've tried both versions available via singularity and the current version available via conda. The only thing that has worked (tentatively) is increasing cpus-per-task and sometimes memory allocation, but EDTA is not actually using all of the resources allocated (4.21% in the run above) and I can't replicate this success on larger genomes.
Some processes in EDTA is singled threaded and could be slow in some genomes if this is your question. As long as it finishes without errors it should be fine. You need to use the latest version of EDTA though which is not in singularity.
Shujun
On Wed, Sep 27, 2023 at 11:41 AM Laramie McKenna Akozbek < @.***> wrote:
@oushujun https://github.com/oushujun -- I'm sorry I didn't clarify this in the original post, but I've tried both versions available via singularity and the current version available via conda. The only thing that has worked (tentatively) is increasing cpus-per-task and sometimes memory allocation, but EDTA is not actually using all of the resources allocated (4.21% in the run above) and I can't replicate this success on larger genomes.
— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/392#issuecomment-1737649571, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NF64SKWC7HPYO7XR53X4RCJXANCNFSM6AAAAAA5HVYC4M . You are receiving this because you were mentioned.Message ID: @.***>
@oushujun -- I have also used the most recent version. EDTA finishes without error in every case, even when it's estimating a total repeat content of 6% for a known 30-33% genome or 15% for a known 80% genome.
@oushujun -- I just wanted to follow-up on this. Do you have any guesses as to what might be causing the behavior I described above?
@laramiemckenna I am not sure. I have not seen this behavior before. Even for the small Arabidopsis genome, the EDTA annotation is reasonable and captures the major numbers. If you don't see any errors, I don't know what may go wrong. Did you test on the rice or Arabidopsis genome?
@oushujun I was not sure what the expected output was for the test data, but I did run it on Arabidopsis TAIR10.1 using the same parameters as I did for the run that was successful above (the second example in the original issue). This is what I got compared to the expected amount of ~21%. I'm extra confused by this because these same exact parameters/version/image were used for the run where it was somewhat successful, but this one wasn't successful.
Script
#!/bin/sh
#SBATCH -e edta_singularity_%j.err
#SBATCH -o edta_singularity_%j.out
#SBATCH --job-name=edta_singularity
#SBATCH --time-min=120:00:00
#SBATCH -c 100
#SBATCH --mem=300G
#SBATCH --partition=plant
#SBATCH --nodes=1
module load cluster/singularity/3.11.0
export PYTHONNOUSERSITE=1
singularity exec [path to]/EDTA.sif EDTA.pl --genome GCA_000001735.2_TAIR10.1_genomic.fna --anno 1
SLURM job output:
Job 1514574 (COMPLETED)
Name edta_singularity
Submit sbatch edta.sh
Nodes plant - plant02
PWD [path to]/arabi_edta_test
Input /dev/null
Output [path to]/arabi_edta_test/edta_singularity_1514574.out
Error [path to]/arabi_edta_test/edta_singularity_1514574.err
Resources CPU = 100 Memory = 307200
Start 2023-10-03 09:31:32
End 2023-10-03 11:58:44
Elapsed 147.2 minutes
Limit 28800 minutes
Exit Code SUCCESS (0)
Usage:
min CPU = 19205.11 sec (5:20:05.11, 2.17 %)
min Mem = 5237.469 MB (1.7 %)
max CPU = 19205.11 sec (5:20:05.11, 2.17 %)
max Mem = 5237.469 MB (1.7 %)
average CPU = 19205.11 sec (5:20:05.11, 2.17 %)
average Mem = 5237.469 MB (1.7 %)
total CPU = 19205.11 sec (5:20:05.11, 2.17 %)
total Mem = 5237.469 MB (1.7 %)
Summary Output:
Repeat Classes
==============
Total Sequences: 7
Total Length: 119482896 bp
Class Count bpMasked %masked
===== ===== ======== =======
LTR -- -- --
Copia 786 925858 0.77%
Gypsy 1634 2410885 2.02%
unknown 405 352244 0.29%
TIR -- -- --
CACTA 589 405463 0.34%
Mutator 1364 792585 0.66%
PIF_Harbinger 284 150803 0.13%
Tc1_Mariner 23 27241 0.02%
hAT 237 105587 0.09%
nonTIR -- -- --
helitron 3066 1818477 1.52%
---------------------------------
total interspersed 8388 6989143 5.85%
---------------------------------------------------------
Total 8388 6989143 5.85%
Below is the output of the test run if that helps!
Script (using same parameters)
#!/bin/sh
#SBATCH -e edta_singularity_%j.err
#SBATCH -o edta_singularity_%j.out
#SBATCH --job-name=edta_singularity
#SBATCH --time-min=120:00:00
#SBATCH -c 100
#SBATCH --mem=300G
#SBATCH --partition=plant
#SBATCH --nodes=1
module load cluster/singularity/3.11.0
export PYTHONNOUSERSITE=1
singularity exec [path to]/EDTA.sif EDTA.pl --genome genome.fa --cds genome.cds.fa --curatedlib ../database/rice6.9.5.liban --exclude genome.exclude.bed --overwrite 1 --sensitive 1 --anno 1 --evaluate 1 --threads 10
SLURM Job Output
Job 1515039 (COMPLETED)
Name edta_singularity
Nodes plant - plant02
Command [path to]/test/test_edta.sh
PWD [path to]/test
Input /dev/null
Output [path to]/edta_singularity_1515039.out
Error [path to]/edta_singularity_1515039.err
CPU nodes = 1 cpus = 100 tasks = 1
TRES cpu=100,mem=300G,node=1,billing=100
Start 2023-10-03 13:33:09
End 2023-10-03 13:36:25
Elapsed 7.77 minutes
Limit 28800 minutes
Summary Output:
Repeat Classes
==============
Total Sequences: 1
Total Length: 1000000 bp
Class Count bpMasked %masked
===== ===== ======== =======
LTR -- -- --
Copia 13 18315 1.83%
Gypsy 46 107087 10.71%
TRIM 1 129 0.01%
unknown 1 248 0.02%
TIR -- -- --
CACTA 24 20363 2.04%
Mutator 110 47775 4.78%
PIF_Harbinger 110 27512 2.75%
Tc1_Mariner 124 48718 4.87%
hAT 34 13891 1.39%
unknown 15 2972 0.30%
nonLTR -- -- --
LINE_element 28 10614 1.06%
SINE_element 11 2329 0.23%
nonTIR -- -- --
helitron 81 57826 5.78%
---------------------------------
total interspersed 598 357779 35.78%
---------------------------------------------------------
Total 598 357779 35.78%
Error File:
/opt/conda/lib/python3.6/site-packages/Bio/Seq.py:2983: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
BiopythonWarning,
2023-10-03 13:35:58,608 -INFO- HMM scanning against `/opt/conda/lib/python3.6/site-packages/TEsorter/database/REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm`
2023-10-03 13:35:58,642 -INFO- Creating server instance (pp-1.6.4.4)
2023-10-03 13:35:58,642 -INFO- Running on Python 3.6.13 linux
2023-10-03 13:35:59,080 -INFO- pp local server started with 10 workers
2023-10-03 13:35:59,097 -INFO- Task 0 started
2023-10-03 13:35:59,098 -INFO- Task 1 started
2023-10-03 13:35:59,098 -INFO- Task 2 started
2023-10-03 13:35:59,098 -INFO- Task 3 started
2023-10-03 13:35:59,098 -INFO- Task 4 started
2023-10-03 13:35:59,099 -INFO- Task 5 started
2023-10-03 13:35:59,099 -INFO- Task 6 started
2023-10-03 13:35:59,099 -INFO- Task 7 started
2023-10-03 13:35:59,099 -INFO- Task 8 started
2023-10-03 13:35:59,100 -INFO- Task 9 started
2023-10-03 13:35:59,730 -INFO- generating gene anntations
2023-10-03 13:35:59,748 -INFO- 2 sequences classified by HMM
2023-10-03 13:35:59,748 -INFO- see protein domain sequences in `genome.cds.fa.code.rexdb.dom.faa` and annotation gff3 file in `genome.cds.fa.code.rexdb.dom.gff3`
2023-10-03 13:35:59,748 -INFO- classifying the unclassified sequences by searching against the classified ones
2023-10-03 13:35:59,761 -INFO- using the 80-80-80 rule
2023-10-03 13:35:59,761 -INFO- run CMD: `makeblastdb -in ./tmp/pass1_classified.fa -dbtype nucl`
2023-10-03 13:35:59,827 -INFO- run CMD: `blastn -query ./tmp/pass1_unclassified.fa -db ./tmp/pass1_classified.fa -out ./tmp/pass1_unclassified.fa.blastout -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen qcovs qcovhsp sstrand' -num_threads 10`
2023-10-03 13:35:59,940 -INFO- 1 sequences classified in pass 2
2023-10-03 13:35:59,940 -INFO- total 3 sequences classified.
2023-10-03 13:35:59,940 -INFO- see classified sequences in `genome.cds.fa.code.rexdb.cls.tsv`
2023-10-03 13:35:59,940 -INFO- writing library for RepeatMasker in `genome.cds.fa.code.rexdb.cls.lib`
2023-10-03 13:35:59,949 -INFO- writing classified protein domains in `genome.cds.fa.code.rexdb.cls.pep`
2023-10-03 13:35:59,951 -INFO- Summary of classifications:
Order Superfamily # of Sequences# of Clade Sequences # of Clades# of full Domains
LTR Gypsy 1 1 1 0
Maverick unknown 2 0 0 0
2023-10-03 13:35:59,952 -INFO- Pipeline done.
2023-10-03 13:35:59,952 -INFO- cleaning the temporary directory ./tmp
Tue Oct 3 13:36:11 CDT 2023 Homology-based annotation of TEs using genome.fa.mod.EDTA.TElib.fa from scratch.
Out File:
Tue Oct 3 13:34:43 CDT 2023 EDTA advance filtering finished.
Tue Oct 3 13:34:43 CDT 2023 Perform EDTA final steps to generate a non-redundant comprehensive TE library:
Use RepeatModeler to identify any remaining TEs that are missed by structure-based methods.
Tue Oct 3 13:35:58 CDT 2023 Clean up TE-related sequences in the CDS file with TEsorter:
Remove CDS-related sequences in the EDTA library:
Tue Oct 3 13:36:05 CDT 2023 Combine the high-quality TE library rice6.9.5.liban with the EDTA library:
Tue Oct 3 13:36:11 CDT 2023 EDTA final stage finished! You may check out:
The final EDTA TE library: genome.fa.mod.EDTA.TElib.fa
Family names of intact TEs have been updated by rice6.9.5.liban: genome.fa.mod.EDTA.intact.gff3
Comparing to the provided library, EDTA found these novel TEs: genome.fa.mod.EDTA.TElib.novel.fa
The provided library has been incorporated into the final library: genome.fa.mod.EDTA.TElib.fa
Tue Oct 3 13:36:11 CDT 2023 Perform post-EDTA analysis for whole-genome annotation:
Tue Oct 3 13:36:17 CDT 2023 TE annotation using the EDTA library has finished! Check out:
Whole-genome TE annotation (total TE: 35.78%): genome.fa.mod.EDTA.TEanno.gff3
Whole-genome TE annotation summary: genome.fa.mod.EDTA.TEanno.sum
Low-threshold TE masking for MAKER gene annotation (masked: 16.47%): genome.fa.mod.MAKER.masked
Tue Oct 3 13:36:17 CDT 2023 Evaluate the level of inconsistency for whole-genome TE annotation (slow step):
Tue Oct 3 13:36:25 CDT 2023 Evaluation of TE annotation finished! Check out these files:
Overall: genome.fa.mod.EDTA.TE.fa.stat.all.sum
Nested: genome.fa.mod.EDTA.TE.fa.stat.nested.sum
Non-nested: genome.fa.mod.EDTA.TE.fa.stat.redun.sum
@laramiemckenna sorry, I also don't understand why you have this low % of TE in Arabidopsis. The only abnormal thing I see is the use of the singularity
version, which is old and outdated. You may want to try the conda
version instead and use the latest github code.
@oushujun The EDTA.yml file for the conda installation still specifies EDTA 2.0.1 but the rest of the repo appear to be much newer (2.1.3). Is there a newer version of this yaml file available, or details on how to mix your conda installation instructions with the newer code in the repo?
@oushujun -- do you mean that I should use the 2.1.0 version and use EDTA.pl through the current repository, which is 2.1.3?
Yes!
On Fri, Oct 13, 2023 at 12:11 PM Laramie McKenna Akozbek < @.***> wrote:
@oushujun https://github.com/oushujun -- do you mean that I should use the 2.1.0 version and use EDTA.pl through the current repository, which is 2.1.3?
— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/392#issuecomment-1761766467, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NCUST7VB4X5XZGBNIDX7FRZXAVCNFSM6AAAAAA5HVYC4OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRRG43DMNBWG4 . You are receiving this because you were mentioned.Message ID: @.***>