EDTA
EDTA copied to clipboard
EDTA crahed after no SINE found
Hello,
I am using EDTA v2.2.0 to process my insect genomes. The commands looks like this:
EDTA.pl --genome ${genome.fa} --species others --step all --overwrite 0 --sensitive 1 --anno 1 --threads 30 --cds ${rep.fna}
The program crashed after failure of finding SINE:
Thu 1 Feb 23:29:39 JST 2024 EDTA_raw: Check dependencies, prepare working directories.
Thu 1 Feb 23:29:41 JST 2024 Start to find LTR candidates.
Thu 1 Feb 23:29:41 JST 2024 Identify LTR retrotransposon candidates from scratch.
Fri 2 Feb 00:09:12 JST 2024 Finish finding LTR candidates.
Fri 2 Feb 00:09:12 JST 2024 Start to find SINE candidates.
cp: cannot stat 'genome.fa.mod.SINE.raw.fa': No such file or directory
Error: SINE results not found!
ERROR: Raw SINE results not found in genome.fa.mod.EDTA.raw/genome.fa.mod.SINE.raw.fa
If you believe the program is working properly, this may be caused by the lack of SINEs in your genome.
It might make some sense as RepeatModeler+RepeatMasker estimated low SINE load in my genomes (<5% for most cases, generally 1.5%-3%). So I am wondering if there is any way to finish EDTA pipeline even if no SINE is found in the genome?
Sincerely,
Cong
That's abnormal. In 2.2.0, it's allowed to have 0 SINE or LINE found. Maybe you were using a slightly older version. Do you see anything in the raw/SINE folder?
Shujun
I am using EDTA v2.2.0 installed by mamba:
$ EDTA.pl
#########################################################
##### Extensive de-novo TE Annotator (EDTA) v2.2.0 #####
##### Shujun Ou ([email protected]) #####
#########################################################
Parameters:
At least 1 parameter is required:
1) Input fasta file: --genome
This is the Extensive de-novo TE Annotator that generates a high-quality
structure-based TE library. Usage:
There is basically nothing in raw/SINE:
$ ls genome.fa.mod.EDTA.raw/SINE/
genome.fa.mod
Sincerely,
Cong
Please pull the GitHub version instead, thanks!
Shujun
Will it work if you add --force 1
to add the rice (I think) repeats to your command ?
Hello,
I tried to pull the EDTA github while keep all dependencies in mamba, but still failed with the test:
$ EDTA.pl --genome genome.fa --cds genome.cds.fa --curatedlib ../database/rice7.0.0.liban --exclude genome.exclude.bed --overwrite 1 --sensitive 1 --anno 1 --threads 10
#########################################################
##### Extensive de-novo TE Annotator (EDTA) v2.2.0 #####
##### Shujun Ou ([email protected]) #####
#########################################################
Parameters: --genome genome.fa --cds genome.cds.fa --curatedlib ../database/rice7.0.0.liban --exclude genome.exclude.bed --overwrite 1 --sensitive 1 --anno 1 --threads 10
Fri 16 Feb 00:24:41 JST 2024 Dependency checking:
All passed!
A custom library ../database/rice7.0.0.liban is provided via --curatedlib. Please make sure this is a manually curated library but not machine generated.
A CDS file genome.cds.fa is provided via --cds. Please make sure this is the DNA sequence of coding regions only.
A BED file is provided via --exclude. Regions specified by this file will be excluded from TE annotation and masking.
Fri 16 Feb 00:24:42 JST 2024 Obtain raw TE libraries using various structure-based programs:
Fri 16 Feb 00:24:42 JST 2024 EDTA_raw: Check dependencies, prepare working directories.
Fri 16 Feb 00:24:43 JST 2024 Start to find LTR candidates.
Fri 16 Feb 00:24:43 JST 2024 Identify LTR retrotransposon candidates from scratch.
Warning: LOC list genome.fa.mod.ltrTE.veryfalse is empty.
Fri 16 Feb 00:25:16 JST 2024 Finish finding LTR candidates.
Fri 16 Feb 00:25:16 JST 2024 Start to find SINE candidates.
cp: cannot stat 'genome.fa.mod.SINE.raw.fa': No such file or directory
Error: SINE results not found!
ERROR: Raw SINE results not found in genome.fa.mod.EDTA.raw/genome.fa.mod.SINE.raw.fa
If you believe the program is working properly, this may be caused by the lack of SINEs in your genome.
I also tried --force 1
. The test was finished with warning:
$ EDTA.pl --genome genome.fa --cds genome.cds.fa --curatedlib ../database/rice7.0.0.liban --exclude genome.exclude.bed --overwrite 1 --sensitive 1 --anno 1 --threads 10 --force 1
#########################################################
##### Extensive de-novo TE Annotator (EDTA) v2.2.0 #####
##### Shujun Ou ([email protected]) #####
#########################################################
Parameters: --genome genome.fa --cds genome.cds.fa --curatedlib ../database/rice7.0.0.liban --exclude genome.exclude.bed --overwrite 1 --sensitive 1 --anno 1 --threads 10 --force 1
Fri 16 Feb 00:29:29 JST 2024 Dependency checking:
All passed!
A custom library ../database/rice7.0.0.liban is provided via --curatedlib. Please make sure this is a manually curated library but not machine generated.
A CDS file genome.cds.fa is provided via --cds. Please make sure this is the DNA sequence of coding regions only.
A BED file is provided via --exclude. Regions specified by this file will be excluded from TE annotation and masking.
Fri 16 Feb 00:29:30 JST 2024 Obtain raw TE libraries using various structure-based programs:
Fri 16 Feb 00:29:30 JST 2024 EDTA_raw: Check dependencies, prepare working directories.
Fri 16 Feb 00:29:31 JST 2024 Start to find LTR candidates.
Fri 16 Feb 00:29:31 JST 2024 Identify LTR retrotransposon candidates from scratch.
Warning: LOC list genome.fa.mod.ltrTE.veryfalse is empty.
Fri 16 Feb 00:30:04 JST 2024 Finish finding LTR candidates.
Fri 16 Feb 00:30:04 JST 2024 Start to find SINE candidates.
cp: cannot stat 'genome.fa.mod.SINE.raw.fa': No such file or directory
Error: SINE results not found!
cat: genome.fa.mod.TIR.intact.raw.bed: No such file or directory
cat: genome.fa.mod.Helitron.intact.raw.bed: No such file or directory
Fri 16 Feb 00:30:04 JST 2024 Obtain raw TE libraries finished.
All intact TEs found by EDTA:
genome.fa.mod.EDTA.intact.raw.fa
genome.fa.mod.EDTA.intact.raw.gff3
Fri 16 Feb 00:30:04 JST 2024 Perform EDTA advance filtering for raw TE candidates and generate the stage 1 library:
Warning: No repetitive sequences were detected in genome.fa.mod.LTR.raw.fa
Warning: No repetitive sequences were detected in genome.fa.mod.SINE.raw.fa
Fri 16 Feb 00:35:07 JST 2024 EDTA advance filtering finished.
Fri 16 Feb 00:35:07 JST 2024 Perform EDTA final steps to generate a non-redundant comprehensive TE library.
cp: cannot stat '../genome.fa.mod.EDTA.raw/genome.fa.mod.RM2.fa': No such file or directory
Skipping the RepeatModeler results (--sensitive 0).
Run EDTA.pl --step final --sensitive 1 if you want to add RepeatModeler results.
Fri 16 Feb 00:35:08 JST 2024 Clean up TE-related sequences in the CDS file with TEsorter.
Remove CDS-related sequences in the EDTA library.
Remove CDS-related sequences in intact TEs.
Fri 16 Feb 00:39:23 JST 2024 Combine the high-quality TE library rice7.0.0.liban with the EDTA library:
Fri 16 Feb 00:41:42 JST 2024 EDTA final stage finished! You may check out:
The final EDTA TE library: genome.fa.mod.EDTA.TElib.fa
Family names of intact TEs have been updated by rice7.0.0.liban: genome.fa.mod.EDTA.intact.gff3
Comparing to the provided library, EDTA found these novel TEs: genome.fa.mod.EDTA.TElib.novel.fa
The provided library has been incorporated into the final library: genome.fa.mod.EDTA.TElib.fa
Fri 16 Feb 00:41:42 JST 2024 Perform post-EDTA analysis for whole-genome annotation:
Fri 16 Feb 00:41:42 JST 2024 Homology-based annotation of TEs using genome.fa.mod.EDTA.TElib.fa from scratch.
Fri 16 Feb 00:42:04 JST 2024 TE annotation using the EDTA library has finished! Check out:
Whole-genome TE annotation (total TE: 29.83%): genome.fa.mod.EDTA.TEanno.gff3
Whole-genome TE annotation summary: genome.fa.mod.EDTA.TEanno.sum
Low-threshold TE masking for MAKER gene annotation (masked: 15.63%): genome.fa.mod.MAKER.masked
Fri 16 Feb 00:42:04 JST 2024 Evaluate the level of inconsistency for whole-genome TE annotation:
Fri 16 Feb 00:42:18 JST 2024 Evaluation of TE annotation finished! Check out these files:
Overall: genome.fa.mod.EDTA.TE.fa.stat.all.sum
Nested: genome.fa.mod.EDTA.TE.fa.stat.nested.sum
Non-nested: genome.fa.mod.EDTA.TE.fa.stat.redun.sum
If you want to learn more about the formatting and information of these files, please visit:
https://github.com/oushujun/EDTA/wiki/Making-sense-of-EDTA-usage-and-outputs---Q&A
The results looks OK?
$ ls -l
total 15238
-rw-r--r-- 1 c-liu bourguignonuni 1000014 Feb 15 18:29 Alyrata.test.fa
-rw-r--r-- 1 c-liu bourguignonuni 1000009 Feb 15 18:29 Col.test.fa
-rw-r--r-- 1 c-liu bourguignonuni 199787 Feb 15 18:29 genome.cds.fa
-rw-r--r-- 1 c-liu bourguignonuni 38 Feb 15 18:29 genome.cds.list
-rw-r--r-- 1 c-liu bourguignonuni 61399 Feb 15 18:29 genome.exclude.bed
-rw-r--r-- 1 c-liu bourguignonuni 1000007 Feb 15 18:29 genome.fa
-rw-r--r-- 1 c-liu bourguignonuni 1000007 Feb 16 00:29 genome.fa.mod
drwxr-sr-x 2 c-liu bourguignonuni 4096 Feb 16 00:42 genome.fa.mod.EDTA.anno
drwxr-sr-x 3 c-liu bourguignonuni 131072 Feb 16 00:35 genome.fa.mod.EDTA.combine
drwxr-sr-x 3 c-liu bourguignonuni 4096 Feb 16 00:41 genome.fa.mod.EDTA.final
-rw-r--r-- 1 c-liu bourguignonuni 2787953 Feb 16 00:41 genome.fa.mod.EDTA.intact.fa
-rw-r--r-- 1 c-liu bourguignonuni 5040 Feb 16 00:41 genome.fa.mod.EDTA.intact.gff3
drwxr-sr-x 7 c-liu bourguignonuni 4096 Feb 16 00:30 genome.fa.mod.EDTA.raw
-rw-r--r-- 1 c-liu bourguignonuni 109850 Feb 16 00:42 genome.fa.mod.EDTA.TEanno.gff3
-rw-r--r-- 1 c-liu bourguignonuni 18759 Feb 16 00:42 genome.fa.mod.EDTA.TEanno.sum
-rw-r--r-- 1 c-liu bourguignonuni 5306510 Feb 16 00:41 genome.fa.mod.EDTA.TElib.fa
-rw-r--r-- 1 c-liu bourguignonuni 0 Feb 16 00:40 genome.fa.mod.EDTA.TElib.novel.fa
-rw-r--r-- 1 c-liu bourguignonuni 1000007 Feb 16 00:42 genome.fa.mod.MAKER.masked
-rw-r--r-- 1 c-liu bourguignonuni 1000010 Feb 15 18:29 Ler.test.fa
-rw-r--r-- 1 c-liu bourguignonuni 543 Feb 15 18:29 memo
-rw-r--r-- 1 c-liu bourguignonuni 996 Feb 15 18:29 README.txt
lrwxrwxrwx 1 c-liu bourguignonuni 73 Feb 16 00:12 rice7.0.0.liban -> /bucket/.mabuya/BourguignonU/Cong/Softwares/EDTA/database/rice7.0.0.liban
However, I do not understand how it will make sense to add rice TEs to distant genomes. In my case I am working with insects that do not have much ecological interactions with rice, and seems people with prokaryotes are also using --force 1
(say #405?). Could you please explain this option with a bit more details? @oushujun
Sincerely,
Cong
Sincerely,
Cong
Hello,
thanks for your nice EDTA. I am using EDTA v2.2.0 to analysis an insect's genome. However, there is no SINEs in some insect, which also found in this passage (https://doi.org/10.1186/s12915-021-01158-2). How can I finish the EDTA? should I rty --force 1
?
Sincerely,
ShuangXiong Wu
Hello @WuSir312
I am running EDTA with --force 1
and sensitive for my insect genomes. I manually checked the *.TEanno.sum for a few genomes in which EDTA already finished, and the results look normal: LINE/SINE are found, the total TE load looks acceptable, the proportion of LINE looks reasonable.
Sincerely,
Cong
Hello Cong and Shuangxiong,
If you are pretty sure that your genome does not have SINE/LINE or any of the TE types EDTA recognizes, using --force 1 will make sense because EDTA will use rice TE libraries to skip the step and allow EDTA to finish. Using rice sequences likely won't impact your existing TEs because they are probably very dissimilar, which means the rice sequences will do nothing except help you finish the EDTA execution. But if you know that your species has the TE type but EDTA didn't have it annotated due to programmatic errors, using --force 1 will not make sense.
Thanks, Shujun
Hi Dr. Shujun,
Thanks for developing such a great program!
Lately I've also encountered the SINE results not found!
problem while annotating TE sequences within pineapple genome with either v2.2.0
or v2.2.1
, and here's the errors said:
cp: cannot stat 'P1_hap1_FINAL.fasta.mod.SINE.raw.fa': No such file or directory
Error: SINE results not found!
ERROR: Raw SINE results not found in P1_hap1_FINAL.fasta.mod.EDTA.raw/P1_hap1_FINAL.fasta.mod.SINE.raw.fa
If you believe the program is working properly, this may be caused by the lack of SINEs in your genome.
But it is strange that I've succeed in annotating the same genome with only sequences of chromosome level right days before via EDTA.pl v2.2.0
pipeline installed by mamba
.
I've also tried to annotate only SINE repeat with EDTA_raw.pl --type sine
and this program surprisingly finished without errors. Here's the output:
.
├── EDTA_SINE.log
├── P1_hap1_FINAL.fasta -> /home/yanyang_liang/ProgramFiles/2024/03_Aco_Annotation/00_Data/01_Genome/P1_hap1_FINAL.fasta
├── P1_hap1_FINAL.fasta.mod
└── P1_hap1_FINAL.fasta.mod.EDTA.raw
├── Helitron
├── LINE
├── LTR
├── P1_hap1_FINAL.fasta.mod.SINE.raw.fa
├── SINE
│ ├── HMM_out
│ ├── P1_hap1_FINAL.fasta_bbb805cef30611ee9c7590e2ba919692-matches.fasta
│ ├── P1_hap1_FINAL.fasta.mod -> ../../P1_hap1_FINAL.fasta.mod
│ ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa
│ ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.cln
│ ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.cln.cleanup
│ ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.cln.list
│ ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.dirt.list
│ ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.cls.lib
│ ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.cls.pep
│ ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.cls.tsv
│ ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.dom.faa
│ ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.dom.gff3
│ ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.domtbl
│ ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.dom.tsv
│ ├── P1_hap1_FINAL.fasta.mod.SINE.raw.fa
│ ├── Seed_SINE.fa
│ ├── Step1_extend_tsd_input_1.fa
│ ├── Step1_extend_tsd_input_2.fa
│ ├── Step1_extend_tsd_input.fa
│ ├── Step2_extend_blast_input.fa
│ ├── Step2_extend_blast_input_rename.fa
│ ├── Step2_tsd_output.fa
│ ├── Step2_tsd.txt
│ ├── Step3_blast_output.out
│ ├── Step3_blast_output.out.fa
│ ├── Step3_blast_output.paf
│ ├── Step3_blast_process_output.fa
│ ├── Step4_rna_input.fasta
│ ├── Step4_rna_output.fasta
│ ├── Step4_rna_output.fasta.2.5.7.80.10.10.2000.dat
│ ├── Step4_rna_output.out
│ ├── Step5_trf_output.fasta
│ ├── Step6_irf_input.fasta
│ ├── Step6_irf_input.fasta.2.3.5.80.10.20.500000.10000.dat
│ ├── Step6_irf_output.fasta
│ ├── Step7_cluster_output.fasta
│ └── Step7_cluster_output.fasta.clstr
└── TIR
7 directories, 41 files
I've checked that there are hundreds of sequence in file Seed_SINE.fa
:
file format type num_seqs sum_len min_len avg_len max_len
Seed_SINE.fa FASTA DNA 122 29,941 98 245.4 755
Any suggestions that I can take to solve this problem?
Best, Yanyang.
Hi Dr. Shujun,
Thanks for developing such a great program!
Lately I've also encountered the
SINE results not found!
problem while annotating TE sequences within pineapple genome with eitherv2.2.0
orv2.2.1
, and here's the errors said:cp: cannot stat 'P1_hap1_FINAL.fasta.mod.SINE.raw.fa': No such file or directory Error: SINE results not found! ERROR: Raw SINE results not found in P1_hap1_FINAL.fasta.mod.EDTA.raw/P1_hap1_FINAL.fasta.mod.SINE.raw.fa If you believe the program is working properly, this may be caused by the lack of SINEs in your genome.
But it is strange that I've succeed in annotating the same genome with only sequences of chromosome level right days before via
EDTA.pl v2.2.0
pipeline installed bymamba
.I've also tried to annotate only SINE repeat with
EDTA_raw.pl --type sine
and this program surprisingly finished without errors. Here's the output:. ├── EDTA_SINE.log ├── P1_hap1_FINAL.fasta -> /home/yanyang_liang/ProgramFiles/2024/03_Aco_Annotation/00_Data/01_Genome/P1_hap1_FINAL.fasta ├── P1_hap1_FINAL.fasta.mod └── P1_hap1_FINAL.fasta.mod.EDTA.raw ├── Helitron ├── LINE ├── LTR ├── P1_hap1_FINAL.fasta.mod.SINE.raw.fa ├── SINE │ ├── HMM_out │ ├── P1_hap1_FINAL.fasta_bbb805cef30611ee9c7590e2ba919692-matches.fasta │ ├── P1_hap1_FINAL.fasta.mod -> ../../P1_hap1_FINAL.fasta.mod │ ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa │ ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.cln │ ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.cln.cleanup │ ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.cln.list │ ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.dirt.list │ ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.cls.lib │ ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.cls.pep │ ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.cls.tsv │ ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.dom.faa │ ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.dom.gff3 │ ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.domtbl │ ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.dom.tsv │ ├── P1_hap1_FINAL.fasta.mod.SINE.raw.fa │ ├── Seed_SINE.fa │ ├── Step1_extend_tsd_input_1.fa │ ├── Step1_extend_tsd_input_2.fa │ ├── Step1_extend_tsd_input.fa │ ├── Step2_extend_blast_input.fa │ ├── Step2_extend_blast_input_rename.fa │ ├── Step2_tsd_output.fa │ ├── Step2_tsd.txt │ ├── Step3_blast_output.out │ ├── Step3_blast_output.out.fa │ ├── Step3_blast_output.paf │ ├── Step3_blast_process_output.fa │ ├── Step4_rna_input.fasta │ ├── Step4_rna_output.fasta │ ├── Step4_rna_output.fasta.2.5.7.80.10.10.2000.dat │ ├── Step4_rna_output.out │ ├── Step5_trf_output.fasta │ ├── Step6_irf_input.fasta │ ├── Step6_irf_input.fasta.2.3.5.80.10.20.500000.10000.dat │ ├── Step6_irf_output.fasta │ ├── Step7_cluster_output.fasta │ └── Step7_cluster_output.fasta.clstr └── TIR 7 directories, 41 files
I've checked that there are hundreds of sequence in file
Seed_SINE.fa
:file format type num_seqs sum_len min_len avg_len max_len Seed_SINE.fa FASTA DNA 122 29,941 98 245.4 755
Any suggestions that I can take to solve this problem?
Best, Yanyang.
Hi Dr.Shuju,
I think I might find the answer to this problem. After I added export PATH="$~/miniconda3/envs/EDTA/bin:$PATH"
to my script, EDTA ran through SINE annotation properly. So it is possibly that some environment variable affected the EDTA pipeline.
Thanks, Yanyang.
Hi Yangyang,
I am glad you found the solution.
The line of code report the error is
die "Error: SINE results not found!\n\n" unless -e "$genome.EDTA.raw/$genome.SINE.raw.fa";
It should work even if your genome file contains a path because this code block handles the path:
my $genome_file = basename($genome);
ln -s $genome $genome_file
unless -e $genome_file; $genome = $genome_file;
Thanks, Shujun