EDTA icon indicating copy to clipboard operation
EDTA copied to clipboard

EDTA crahed after no SINE found

Open CongLiu37 opened this issue 1 year ago • 5 comments

Hello,

I am using EDTA v2.2.0 to process my insect genomes. The commands looks like this: EDTA.pl --genome ${genome.fa} --species others --step all --overwrite 0 --sensitive 1 --anno 1 --threads 30 --cds ${rep.fna} The program crashed after failure of finding SINE:

Thu  1 Feb 23:29:39 JST 2024	EDTA_raw: Check dependencies, prepare working directories.

Thu  1 Feb 23:29:41 JST 2024	Start to find LTR candidates.

Thu  1 Feb 23:29:41 JST 2024	Identify LTR retrotransposon candidates from scratch.

Fri  2 Feb 00:09:12 JST 2024	Finish finding LTR candidates.

Fri  2 Feb 00:09:12 JST 2024	Start to find SINE candidates.

cp: cannot stat 'genome.fa.mod.SINE.raw.fa': No such file or directory
Error: SINE results not found!

ERROR: Raw SINE results not found in genome.fa.mod.EDTA.raw/genome.fa.mod.SINE.raw.fa
	If you believe the program is working properly, this may be caused by the lack of SINEs in your genome.

It might make some sense as RepeatModeler+RepeatMasker estimated low SINE load in my genomes (<5% for most cases, generally 1.5%-3%). So I am wondering if there is any way to finish EDTA pipeline even if no SINE is found in the genome?

Sincerely,

Cong

CongLiu37 avatar Feb 02 '24 04:02 CongLiu37

That's abnormal. In 2.2.0, it's allowed to have 0 SINE or LINE found. Maybe you were using a slightly older version. Do you see anything in the raw/SINE folder?

Shujun

oushujun avatar Feb 02 '24 15:02 oushujun

I am using EDTA v2.2.0 installed by mamba:

$ EDTA.pl

#########################################################
##### Extensive de-novo TE Annotator (EDTA) v2.2.0  #####
##### Shujun Ou ([email protected])             #####
#########################################################


Parameters: 


At least 1 parameter is required:
1) Input fasta file: --genome

This is the Extensive de-novo TE Annotator that generates a high-quality
structure-based TE library. Usage:

There is basically nothing in raw/SINE:

$ ls genome.fa.mod.EDTA.raw/SINE/
genome.fa.mod

Sincerely,

Cong

CongLiu37 avatar Feb 03 '24 04:02 CongLiu37

Please pull the GitHub version instead, thanks!

Shujun

oushujun avatar Feb 03 '24 14:02 oushujun

Will it work if you add --force 1 to add the rice (I think) repeats to your command ?

colindaven avatar Feb 12 '24 13:02 colindaven

Hello,

I tried to pull the EDTA github while keep all dependencies in mamba, but still failed with the test:

$ EDTA.pl --genome genome.fa --cds genome.cds.fa --curatedlib ../database/rice7.0.0.liban --exclude genome.exclude.bed --overwrite 1 --sensitive 1 --anno 1 --threads 10

#########################################################
##### Extensive de-novo TE Annotator (EDTA) v2.2.0  #####
##### Shujun Ou ([email protected])             #####
#########################################################


Parameters: --genome genome.fa --cds genome.cds.fa --curatedlib ../database/rice7.0.0.liban --exclude genome.exclude.bed --overwrite 1 --sensitive 1 --anno 1 --threads 10


Fri 16 Feb 00:24:41 JST 2024	Dependency checking:
				All passed!

	A custom library ../database/rice7.0.0.liban is provided via --curatedlib. Please make sure this is a manually curated library but not machine generated.

	A CDS file genome.cds.fa is provided via --cds. Please make sure this is the DNA sequence of coding regions only.

	A BED file is provided via --exclude. Regions specified by this file will be excluded from TE annotation and masking.

Fri 16 Feb 00:24:42 JST 2024	Obtain raw TE libraries using various structure-based programs: 
Fri 16 Feb 00:24:42 JST 2024	EDTA_raw: Check dependencies, prepare working directories.

Fri 16 Feb 00:24:43 JST 2024	Start to find LTR candidates.

Fri 16 Feb 00:24:43 JST 2024	Identify LTR retrotransposon candidates from scratch.

Warning: LOC list genome.fa.mod.ltrTE.veryfalse is empty.
Fri 16 Feb 00:25:16 JST 2024	Finish finding LTR candidates.

Fri 16 Feb 00:25:16 JST 2024	Start to find SINE candidates.

cp: cannot stat 'genome.fa.mod.SINE.raw.fa': No such file or directory
Error: SINE results not found!

ERROR: Raw SINE results not found in genome.fa.mod.EDTA.raw/genome.fa.mod.SINE.raw.fa
	If you believe the program is working properly, this may be caused by the lack of SINEs in your genome. 

I also tried --force 1. The test was finished with warning:

$ EDTA.pl --genome genome.fa --cds genome.cds.fa --curatedlib ../database/rice7.0.0.liban --exclude genome.exclude.bed --overwrite 1 --sensitive 1 --anno 1 --threads 10 --force 1

#########################################################
##### Extensive de-novo TE Annotator (EDTA) v2.2.0  #####
##### Shujun Ou ([email protected])             #####
#########################################################


Parameters: --genome genome.fa --cds genome.cds.fa --curatedlib ../database/rice7.0.0.liban --exclude genome.exclude.bed --overwrite 1 --sensitive 1 --anno 1 --threads 10 --force 1


Fri 16 Feb 00:29:29 JST 2024	Dependency checking:
				All passed!

	A custom library ../database/rice7.0.0.liban is provided via --curatedlib. Please make sure this is a manually curated library but not machine generated.

	A CDS file genome.cds.fa is provided via --cds. Please make sure this is the DNA sequence of coding regions only.

	A BED file is provided via --exclude. Regions specified by this file will be excluded from TE annotation and masking.

Fri 16 Feb 00:29:30 JST 2024	Obtain raw TE libraries using various structure-based programs: 
Fri 16 Feb 00:29:30 JST 2024	EDTA_raw: Check dependencies, prepare working directories.

Fri 16 Feb 00:29:31 JST 2024	Start to find LTR candidates.

Fri 16 Feb 00:29:31 JST 2024	Identify LTR retrotransposon candidates from scratch.

Warning: LOC list genome.fa.mod.ltrTE.veryfalse is empty.
Fri 16 Feb 00:30:04 JST 2024	Finish finding LTR candidates.

Fri 16 Feb 00:30:04 JST 2024	Start to find SINE candidates.

cp: cannot stat 'genome.fa.mod.SINE.raw.fa': No such file or directory
Error: SINE results not found!

cat: genome.fa.mod.TIR.intact.raw.bed: No such file or directory
cat: genome.fa.mod.Helitron.intact.raw.bed: No such file or directory
Fri 16 Feb 00:30:04 JST 2024	Obtain raw TE libraries finished.
				All intact TEs found by EDTA: 
					genome.fa.mod.EDTA.intact.raw.fa 
					genome.fa.mod.EDTA.intact.raw.gff3

Fri 16 Feb 00:30:04 JST 2024	Perform EDTA advance filtering for raw TE candidates and generate the stage 1 library: 


Warning: No repetitive sequences were detected in genome.fa.mod.LTR.raw.fa

Warning: No repetitive sequences were detected in genome.fa.mod.SINE.raw.fa
Fri 16 Feb 00:35:07 JST 2024	EDTA advance filtering finished.

Fri 16 Feb 00:35:07 JST 2024	Perform EDTA final steps to generate a non-redundant comprehensive TE library.

cp: cannot stat '../genome.fa.mod.EDTA.raw/genome.fa.mod.RM2.fa': No such file or directory
				Skipping the RepeatModeler results (--sensitive 0).
				Run EDTA.pl --step final --sensitive 1 if you want to add RepeatModeler results.

Fri 16 Feb 00:35:08 JST 2024	Clean up TE-related sequences in the CDS file with TEsorter.

				Remove CDS-related sequences in the EDTA library.

				Remove CDS-related sequences in intact TEs.

Fri 16 Feb 00:39:23 JST 2024	Combine the high-quality TE library rice7.0.0.liban with the EDTA library:

Fri 16 Feb 00:41:42 JST 2024	EDTA final stage finished! You may check out:
				The final EDTA TE library: genome.fa.mod.EDTA.TElib.fa
				Family names of intact TEs have been updated by rice7.0.0.liban: genome.fa.mod.EDTA.intact.gff3
				Comparing to the provided library, EDTA found these novel TEs: genome.fa.mod.EDTA.TElib.novel.fa
				The provided library has been incorporated into the final library: genome.fa.mod.EDTA.TElib.fa

Fri 16 Feb 00:41:42 JST 2024	Perform post-EDTA analysis for whole-genome annotation:

Fri 16 Feb 00:41:42 JST 2024	Homology-based annotation of TEs using genome.fa.mod.EDTA.TElib.fa from scratch.

Fri 16 Feb 00:42:04 JST 2024	TE annotation using the EDTA library has finished! Check out:
				Whole-genome TE annotation (total TE: 29.83%): genome.fa.mod.EDTA.TEanno.gff3
				Whole-genome TE annotation summary: genome.fa.mod.EDTA.TEanno.sum
				Low-threshold TE masking for MAKER gene annotation (masked: 15.63%): genome.fa.mod.MAKER.masked

Fri 16 Feb 00:42:04 JST 2024	Evaluate the level of inconsistency for whole-genome TE annotation:

Fri 16 Feb 00:42:18 JST 2024	Evaluation of TE annotation finished! Check out these files:

				Overall: genome.fa.mod.EDTA.TE.fa.stat.all.sum
				Nested: genome.fa.mod.EDTA.TE.fa.stat.nested.sum
				Non-nested: genome.fa.mod.EDTA.TE.fa.stat.redun.sum

				If you want to learn more about the formatting and information of these files, please visit:
					https://github.com/oushujun/EDTA/wiki/Making-sense-of-EDTA-usage-and-outputs---Q&A

The results looks OK?

$ ls -l 
total 15238
-rw-r--r-- 1 c-liu bourguignonuni 1000014 Feb 15 18:29 Alyrata.test.fa
-rw-r--r-- 1 c-liu bourguignonuni 1000009 Feb 15 18:29 Col.test.fa
-rw-r--r-- 1 c-liu bourguignonuni  199787 Feb 15 18:29 genome.cds.fa
-rw-r--r-- 1 c-liu bourguignonuni      38 Feb 15 18:29 genome.cds.list
-rw-r--r-- 1 c-liu bourguignonuni   61399 Feb 15 18:29 genome.exclude.bed
-rw-r--r-- 1 c-liu bourguignonuni 1000007 Feb 15 18:29 genome.fa
-rw-r--r-- 1 c-liu bourguignonuni 1000007 Feb 16 00:29 genome.fa.mod
drwxr-sr-x 2 c-liu bourguignonuni    4096 Feb 16 00:42 genome.fa.mod.EDTA.anno
drwxr-sr-x 3 c-liu bourguignonuni  131072 Feb 16 00:35 genome.fa.mod.EDTA.combine
drwxr-sr-x 3 c-liu bourguignonuni    4096 Feb 16 00:41 genome.fa.mod.EDTA.final
-rw-r--r-- 1 c-liu bourguignonuni 2787953 Feb 16 00:41 genome.fa.mod.EDTA.intact.fa
-rw-r--r-- 1 c-liu bourguignonuni    5040 Feb 16 00:41 genome.fa.mod.EDTA.intact.gff3
drwxr-sr-x 7 c-liu bourguignonuni    4096 Feb 16 00:30 genome.fa.mod.EDTA.raw
-rw-r--r-- 1 c-liu bourguignonuni  109850 Feb 16 00:42 genome.fa.mod.EDTA.TEanno.gff3
-rw-r--r-- 1 c-liu bourguignonuni   18759 Feb 16 00:42 genome.fa.mod.EDTA.TEanno.sum
-rw-r--r-- 1 c-liu bourguignonuni 5306510 Feb 16 00:41 genome.fa.mod.EDTA.TElib.fa
-rw-r--r-- 1 c-liu bourguignonuni       0 Feb 16 00:40 genome.fa.mod.EDTA.TElib.novel.fa
-rw-r--r-- 1 c-liu bourguignonuni 1000007 Feb 16 00:42 genome.fa.mod.MAKER.masked
-rw-r--r-- 1 c-liu bourguignonuni 1000010 Feb 15 18:29 Ler.test.fa
-rw-r--r-- 1 c-liu bourguignonuni     543 Feb 15 18:29 memo
-rw-r--r-- 1 c-liu bourguignonuni     996 Feb 15 18:29 README.txt
lrwxrwxrwx 1 c-liu bourguignonuni      73 Feb 16 00:12 rice7.0.0.liban -> /bucket/.mabuya/BourguignonU/Cong/Softwares/EDTA/database/rice7.0.0.liban

However, I do not understand how it will make sense to add rice TEs to distant genomes. In my case I am working with insects that do not have much ecological interactions with rice, and seems people with prokaryotes are also using --force 1 (say #405?). Could you please explain this option with a bit more details? @oushujun

Sincerely,

Cong

Sincerely,

Cong

CongLiu37 avatar Feb 15 '24 15:02 CongLiu37

Hello, thanks for your nice EDTA. I am using EDTA v2.2.0 to analysis an insect's genome. However, there is no SINEs in some insect, which also found in this passage (https://doi.org/10.1186/s12915-021-01158-2). How can I finish the EDTA? should I rty --force 1? Sincerely,

ShuangXiong Wu

WuSir312 avatar Mar 13 '24 11:03 WuSir312

Hello @WuSir312

I am running EDTA with --force 1 and sensitive for my insect genomes. I manually checked the *.TEanno.sum for a few genomes in which EDTA already finished, and the results look normal: LINE/SINE are found, the total TE load looks acceptable, the proportion of LINE looks reasonable.

Sincerely,

Cong

CongLiu37 avatar Mar 16 '24 13:03 CongLiu37

Hello Cong and Shuangxiong,

If you are pretty sure that your genome does not have SINE/LINE or any of the TE types EDTA recognizes, using --force 1 will make sense because EDTA will use rice TE libraries to skip the step and allow EDTA to finish. Using rice sequences likely won't impact your existing TEs because they are probably very dissimilar, which means the rice sequences will do nothing except help you finish the EDTA execution. But if you know that your species has the TE type but EDTA didn't have it annotated due to programmatic errors, using --force 1 will not make sense.

Thanks, Shujun

oushujun avatar Mar 18 '24 20:03 oushujun

Hi Dr. Shujun,

Thanks for developing such a great program!

Lately I've also encountered the SINE results not found! problem while annotating TE sequences within pineapple genome with either v2.2.0 or v2.2.1, and here's the errors said:

cp: cannot stat 'P1_hap1_FINAL.fasta.mod.SINE.raw.fa': No such file or directory
Error: SINE results not found!

ERROR: Raw SINE results not found in P1_hap1_FINAL.fasta.mod.EDTA.raw/P1_hap1_FINAL.fasta.mod.SINE.raw.fa
	If you believe the program is working properly, this may be caused by the lack of SINEs in your genome.

But it is strange that I've succeed in annotating the same genome with only sequences of chromosome level right days before via EDTA.pl v2.2.0 pipeline installed by mamba.

I've also tried to annotate only SINE repeat with EDTA_raw.pl --type sine and this program surprisingly finished without errors. Here's the output:

.
├── EDTA_SINE.log
├── P1_hap1_FINAL.fasta -> /home/yanyang_liang/ProgramFiles/2024/03_Aco_Annotation/00_Data/01_Genome/P1_hap1_FINAL.fasta
├── P1_hap1_FINAL.fasta.mod
└── P1_hap1_FINAL.fasta.mod.EDTA.raw
    ├── Helitron
    ├── LINE
    ├── LTR
    ├── P1_hap1_FINAL.fasta.mod.SINE.raw.fa
    ├── SINE
    │   ├── HMM_out
    │   ├── P1_hap1_FINAL.fasta_bbb805cef30611ee9c7590e2ba919692-matches.fasta
    │   ├── P1_hap1_FINAL.fasta.mod -> ../../P1_hap1_FINAL.fasta.mod
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.cln
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.cln.cleanup
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.cln.list
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.dirt.list
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.cls.lib
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.cls.pep
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.cls.tsv
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.dom.faa
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.dom.gff3
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.domtbl
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.dom.tsv
    │   ├── P1_hap1_FINAL.fasta.mod.SINE.raw.fa
    │   ├── Seed_SINE.fa
    │   ├── Step1_extend_tsd_input_1.fa
    │   ├── Step1_extend_tsd_input_2.fa
    │   ├── Step1_extend_tsd_input.fa
    │   ├── Step2_extend_blast_input.fa
    │   ├── Step2_extend_blast_input_rename.fa
    │   ├── Step2_tsd_output.fa
    │   ├── Step2_tsd.txt
    │   ├── Step3_blast_output.out
    │   ├── Step3_blast_output.out.fa
    │   ├── Step3_blast_output.paf
    │   ├── Step3_blast_process_output.fa
    │   ├── Step4_rna_input.fasta
    │   ├── Step4_rna_output.fasta
    │   ├── Step4_rna_output.fasta.2.5.7.80.10.10.2000.dat
    │   ├── Step4_rna_output.out
    │   ├── Step5_trf_output.fasta
    │   ├── Step6_irf_input.fasta
    │   ├── Step6_irf_input.fasta.2.3.5.80.10.20.500000.10000.dat
    │   ├── Step6_irf_output.fasta
    │   ├── Step7_cluster_output.fasta
    │   └── Step7_cluster_output.fasta.clstr
    └── TIR

7 directories, 41 files

I've checked that there are hundreds of sequence in file Seed_SINE.fa:

file          format  type  num_seqs  sum_len  min_len  avg_len  max_len
Seed_SINE.fa  FASTA   DNA        122   29,941       98    245.4      755

Any suggestions that I can take to solve this problem?

Best, Yanyang.

yyliang12 avatar Apr 04 '24 15:04 yyliang12

Hi Dr. Shujun,

Thanks for developing such a great program!

Lately I've also encountered the SINE results not found! problem while annotating TE sequences within pineapple genome with either v2.2.0 or v2.2.1, and here's the errors said:

cp: cannot stat 'P1_hap1_FINAL.fasta.mod.SINE.raw.fa': No such file or directory
Error: SINE results not found!

ERROR: Raw SINE results not found in P1_hap1_FINAL.fasta.mod.EDTA.raw/P1_hap1_FINAL.fasta.mod.SINE.raw.fa
	If you believe the program is working properly, this may be caused by the lack of SINEs in your genome.

But it is strange that I've succeed in annotating the same genome with only sequences of chromosome level right days before via EDTA.pl v2.2.0 pipeline installed by mamba.

I've also tried to annotate only SINE repeat with EDTA_raw.pl --type sine and this program surprisingly finished without errors. Here's the output:

.
├── EDTA_SINE.log
├── P1_hap1_FINAL.fasta -> /home/yanyang_liang/ProgramFiles/2024/03_Aco_Annotation/00_Data/01_Genome/P1_hap1_FINAL.fasta
├── P1_hap1_FINAL.fasta.mod
└── P1_hap1_FINAL.fasta.mod.EDTA.raw
    ├── Helitron
    ├── LINE
    ├── LTR
    ├── P1_hap1_FINAL.fasta.mod.SINE.raw.fa
    ├── SINE
    │   ├── HMM_out
    │   ├── P1_hap1_FINAL.fasta_bbb805cef30611ee9c7590e2ba919692-matches.fasta
    │   ├── P1_hap1_FINAL.fasta.mod -> ../../P1_hap1_FINAL.fasta.mod
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.cln
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.cln.cleanup
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.cln.list
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.dirt.list
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.cls.lib
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.cls.pep
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.cls.tsv
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.dom.faa
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.dom.gff3
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.domtbl
    │   ├── P1_hap1_FINAL.fasta.mod.AnnoSINE.raw.fa.rexdb.dom.tsv
    │   ├── P1_hap1_FINAL.fasta.mod.SINE.raw.fa
    │   ├── Seed_SINE.fa
    │   ├── Step1_extend_tsd_input_1.fa
    │   ├── Step1_extend_tsd_input_2.fa
    │   ├── Step1_extend_tsd_input.fa
    │   ├── Step2_extend_blast_input.fa
    │   ├── Step2_extend_blast_input_rename.fa
    │   ├── Step2_tsd_output.fa
    │   ├── Step2_tsd.txt
    │   ├── Step3_blast_output.out
    │   ├── Step3_blast_output.out.fa
    │   ├── Step3_blast_output.paf
    │   ├── Step3_blast_process_output.fa
    │   ├── Step4_rna_input.fasta
    │   ├── Step4_rna_output.fasta
    │   ├── Step4_rna_output.fasta.2.5.7.80.10.10.2000.dat
    │   ├── Step4_rna_output.out
    │   ├── Step5_trf_output.fasta
    │   ├── Step6_irf_input.fasta
    │   ├── Step6_irf_input.fasta.2.3.5.80.10.20.500000.10000.dat
    │   ├── Step6_irf_output.fasta
    │   ├── Step7_cluster_output.fasta
    │   └── Step7_cluster_output.fasta.clstr
    └── TIR

7 directories, 41 files

I've checked that there are hundreds of sequence in file Seed_SINE.fa:

file          format  type  num_seqs  sum_len  min_len  avg_len  max_len
Seed_SINE.fa  FASTA   DNA        122   29,941       98    245.4      755

Any suggestions that I can take to solve this problem?

Best, Yanyang.

Hi Dr.Shuju,

I think I might find the answer to this problem. After I added export PATH="$~/miniconda3/envs/EDTA/bin:$PATH" to my script, EDTA ran through SINE annotation properly. So it is possibly that some environment variable affected the EDTA pipeline.

Thanks, Yanyang.

yyliang12 avatar Apr 05 '24 12:04 yyliang12

Hi Yangyang,

I am glad you found the solution.

The line of code report the error is

die "Error: SINE results not found!\n\n" unless -e "$genome.EDTA.raw/$genome.SINE.raw.fa";

It should work even if your genome file contains a path because this code block handles the path:

my $genome_file = basename($genome); ln -s $genome $genome_file unless -e $genome_file; $genome = $genome_file;

Thanks, Shujun

oushujun avatar Apr 12 '24 15:04 oushujun