EDTA icon indicating copy to clipboard operation
EDTA copied to clipboard

Use docker image to drive GitHub version EDTA

Open Wanjie-Feng opened this issue 6 months ago • 6 comments

hi, shujun Thank you for developing this tool ! I ran into some problems while running this test file, and here is the code I ran:

perl ../EDTA.pl --genome genome.fa --overwrite 1 --sensitive 1 --anno 1 --evaluate 1 --threads 80

The problematic output is as follows:

2023年 12月 11日 星期一 19:51:52 CST    Start to find LTR candidates.

2023年 12月 11日 星期一 19:51:52 CST    Identify LTR retrotransposon candidates from scratch.

awk: cannot open genome.fa.mod.retriever.scn.extend.fa.rexdb.cls.tsv (No such file or directory)
Warning: LOC list genome.fa.mod.ltrTE.veryfalse is empty.
2023年 12月 11日 星期一 19:52:11 CST    Finish finding LTR candidates.
2023年 12月 11日 星期一 19:52:54 CST    Perform EDTA advance filtering for raw TE candidates and generate the stage 1 library: 

No such file or directory at /Data6/wanjie/MP/01hifiasm_hifi_ont_hic/02contig/07mapping_2_NCBI/Gp03/Repeatannotation/testEDTA/EDTA/util/TE_purifier.pl line 108.

        Input file "genome.fa.mod.LTR.raw.fa-genome.fa.mod.TIR.raw.fa.fa" not found!


        Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa 
        Options:
                -misschar       [n|l]   Define the letter representing unknown sequences; default: n. l: recognize lower case letters
                -Nscreen        [0|1]   Enable (1) or disable (0) the -nc parameter; default: 1
                -nc             [int]   Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0
                -nr             [0-1]   Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1
                -minlen         [int]   Minimum sequence length filter after clean up; default: 100 (bp)
                -maxlen         [int]   Maximum sequence length filter after clean up; default: 25000 (bp)
                -cleanN         [0|1]   Retain (0) or remove (1) the -misschar taget in output sequence; default: 0
                -cleanT         [0|1]   Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).
                -minrm          [int]   The minimum length of -misschar to be removed if -cleanN 1; default: 1.
                -trf            [0|1]   Enable (1) or disable (0) tandem repeat finder (trf); default: 1
                -trf_path       path    Path to the trf program
        
The TE1 file genome.fa.mod.LTR.raw.fa.HQ is not found or it's empty!

        A script to purify a TE library based on another TE file containing the target contaminant.
        This is to use the richness difference between TE1 and TE2. Real contaminants in TE1 is rare but rich in TE2.
                Usage: perl TE_purifier.pl -TE1 [fasta] -TE2 [fasta]
                options:        -TE1    [fasta] The file to be purified.
                                -TE2    [fasta] The file that mainly consists of TE1 contaminants.
                                -lower  [0|1]   Mask contaminants in TE1 with lowercase letters (1, default) or Ns (0).
                                -minlen [int]   The shortest length (bp) of sequence matches to be considered. Default: 50
                                -miniden        [int]   The minimum identity (%) to be considered a real match. Default: 60
                                -mindiff        [float] The minimum fold difference in richness between TE1 and TE2 for a 
                                                        sequence to be considered as real to TE1.
                                -repeatmasker   [path]  The directory containing RepeatMasker (default: read from ENV)
                                -blastplus      [path]  The directory containing Blastn (default: read from ENV)
                                -threads        [int]   Number of theads to run this script
                                -help|-h        Display this help info


        Input file "genome.fa.mod.LTR.raw.fa.HQ-genome.fa.mod.Helitron.raw.fa.fa" not found!


        Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa 
        Options:
                -misschar       [n|l]   Define the letter representing unknown sequences; default: n. l: recognize lower case letters
                -Nscreen        [0|1]   Enable (1) or disable (0) the -nc parameter; default: 1
                -nc             [int]   Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0
                -nr             [0-1]   Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1
                -minlen         [int]   Minimum sequence length filter after clean up; default: 100 (bp)
                -maxlen         [int]   Maximum sequence length filter after clean up; default: 25000 (bp)
                -cleanN         [0|1]   Retain (0) or remove (1) the -misschar taget in output sequence; default: 0
                -cleanT         [0|1]   Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).
                -minrm          [int]   The minimum length of -misschar to be removed if -cleanN 1; default: 1.
                -trf            [0|1]   Enable (1) or disable (0) tandem repeat finder (trf); default: 1
                -trf_path       path    Path to the trf program
        
No such file or directory at /Data6/wanjie/MP/01hifiasm_hifi_ont_hic/02contig/07mapping_2_NCBI/Gp03/Repeatannotation/testEDTA/EDTA/util/TE_purifier.pl line 108.

        Input file "genome.fa.mod.Helitron.raw.fa-genome.fa.mod.TIR.raw.fa.fa" not found!


        Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa 
        Options:
                -misschar       [n|l]   Define the letter representing unknown sequences; default: n. l: recognize lower case letters
                -Nscreen        [0|1]   Enable (1) or disable (0) the -nc parameter; default: 1
                -nc             [int]   Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0
                -nr             [0-1]   Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1
                -minlen         [int]   Minimum sequence length filter after clean up; default: 100 (bp)
                -maxlen         [int]   Maximum sequence length filter after clean up; default: 25000 (bp)
                -cleanN         [0|1]   Retain (0) or remove (1) the -misschar taget in output sequence; default: 0
                -cleanT         [0|1]   Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).
                -minrm          [int]   The minimum length of -misschar to be removed if -cleanN 1; default: 1.
                -trf            [0|1]   Enable (1) or disable (0) tandem repeat finder (trf); default: 1
                -trf_path       path    Path to the trf program
        
The TE1 file genome.fa.mod.Helitron.raw.fa.HQ is not found or it's empty!

        A script to purify a TE library based on another TE file containing the target contaminant.
        This is to use the richness difference between TE1 and TE2. Real contaminants in TE1 is rare but rich in TE2.
                Usage: perl TE_purifier.pl -TE1 [fasta] -TE2 [fasta]
                options:        -TE1    [fasta] The file to be purified.
                                -TE2    [fasta] The file that mainly consists of TE1 contaminants.
                                -lower  [0|1]   Mask contaminants in TE1 with lowercase letters (1, default) or Ns (0).
                                -minlen [int]   The shortest length (bp) of sequence matches to be considered. Default: 50
                                -miniden        [int]   The minimum identity (%) to be considered a real match. Default: 60
                                -mindiff        [float] The minimum fold difference in richness between TE1 and TE2 for a 
                                                        sequence to be considered as real to TE1.
                                -repeatmasker   [path]  The directory containing RepeatMasker (default: read from ENV)
                                -blastplus      [path]  The directory containing Blastn (default: read from ENV)
                                -threads        [int]   Number of theads to run this script
                                -help|-h        Display this help info


        Input file "genome.fa.mod.Helitron.raw.fa.HQ-genome.fa.mod.LTR.raw.fa.fa" not found!


        Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa 
        Options:
                -misschar       [n|l]   Define the letter representing unknown sequences; default: n. l: recognize lower case letters
                -Nscreen        [0|1]   Enable (1) or disable (0) the -nc parameter; default: 1
                -nc             [int]   Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0
                -nr             [0-1]   Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1
                -minlen         [int]   Minimum sequence length filter after clean up; default: 100 (bp)
                -maxlen         [int]   Maximum sequence length filter after clean up; default: 25000 (bp)
                -cleanN         [0|1]   Retain (0) or remove (1) the -misschar taget in output sequence; default: 0
                -cleanT         [0|1]   Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).
                -minrm          [int]   The minimum length of -misschar to be removed if -cleanN 1; default: 1.
                -trf            [0|1]   Enable (1) or disable (0) tandem repeat finder (trf); default: 1
                -trf_path       path    Path to the trf program
        
No such file or directory at /Data6/wanjie/MP/01hifiasm_hifi_ont_hic/02contig/07mapping_2_NCBI/Gp03/Repeatannotation/testEDTA/EDTA/util/TE_purifier.pl line 108.

        Input file "genome.fa.mod.TIR.raw.fa-genome.fa.mod.LTR.raw.fa.fa" not found!


        Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa 
        Options:
                -misschar       [n|l]   Define the letter representing unknown sequences; default: n. l: recognize lower case letters
                -Nscreen        [0|1]   Enable (1) or disable (0) the -nc parameter; default: 1
                -nc             [int]   Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0
                -nr             [0-1]   Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1
                -minlen         [int]   Minimum sequence length filter after clean up; default: 100 (bp)
                -maxlen         [int]   Maximum sequence length filter after clean up; default: 25000 (bp)
                -cleanN         [0|1]   Retain (0) or remove (1) the -misschar taget in output sequence; default: 0
                -cleanT         [0|1]   Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).
                -minrm          [int]   The minimum length of -misschar to be removed if -cleanN 1; default: 1.
                -trf            [0|1]   Enable (1) or disable (0) tandem repeat finder (trf); default: 1
                -trf_path       path    Path to the trf program
        
The TE1 file genome.fa.mod.TIR.raw.fa.HQ is not found or it's empty!

        A script to purify a TE library based on another TE file containing the target contaminant.
        This is to use the richness difference between TE1 and TE2. Real contaminants in TE1 is rare but rich in TE2.
                Usage: perl TE_purifier.pl -TE1 [fasta] -TE2 [fasta]
                options:        -TE1    [fasta] The file to be purified.
                                -TE2    [fasta] The file that mainly consists of TE1 contaminants.
                                -lower  [0|1]   Mask contaminants in TE1 with lowercase letters (1, default) or Ns (0).
                                -minlen [int]   The shortest length (bp) of sequence matches to be considered. Default: 50
                                -miniden        [int]   The minimum identity (%) to be considered a real match. Default: 60
                                -mindiff        [float] The minimum fold difference in richness between TE1 and TE2 for a 
                                                        sequence to be considered as real to TE1.
                                -repeatmasker   [path]  The directory containing RepeatMasker (default: read from ENV)
                                -blastplus      [path]  The directory containing Blastn (default: read from ENV)
                                -threads        [int]   Number of theads to run this script
                                -help|-h        Display this help info


        Input file "genome.fa.mod.TIR.raw.fa.HQ-genome.fa.mod.Helitron.raw.fa.fa" not found!


        Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa 
        Options:
                -misschar       [n|l]   Define the letter representing unknown sequences; default: n. l: recognize lower case letters
                -Nscreen        [0|1]   Enable (1) or disable (0) the -nc parameter; default: 1
                -nc             [int]   Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0
                -nr             [0-1]   Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1
                -minlen         [int]   Minimum sequence length filter after clean up; default: 100 (bp)
                -maxlen         [int]   Maximum sequence length filter after clean up; default: 25000 (bp)
                -cleanN         [0|1]   Retain (0) or remove (1) the -misschar taget in output sequence; default: 0
                -cleanT         [0|1]   Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).
                -minrm          [int]   The minimum length of -misschar to be removed if -cleanN 1; default: 1.
                -trf            [0|1]   Enable (1) or disable (0) tandem repeat finder (trf); default: 1
                -trf_path       path    Path to the trf program
        

RepeatMasker version 4.1.5

WARNING: The nolow option should be used with caution.  This option
         doesn't simply filter out simple repeats and low-complexity
         annotations from the output, rather it doesn't run these
         searches at all.  The simple repeats, and low-complexity
         sequences may then be falsely annotated as fragments of
         TE families that contain short stretches of them.

Search Engine: NCBI/RMBLAST [ 2.14.1+ ]
RepeatMasker::setspecies: Could not find user specified library genome.fa.mod.LTR.raw.fa.HQ, or the file is empty.


        Input file "genome.fa.mod.TIR.Helitron.fa.stg1.raw.masked" not found!


        Usage: perl cleanup_tandem.pl -f sample.fa [options] > sample.cln.fa 
        Options:
                -misschar       [n|l]   Define the letter representing unknown sequences; default: n. l: recognize lower case letters
                -Nscreen        [0|1]   Enable (1) or disable (0) the -nc parameter; default: 1
                -nc             [int]   Ambuguous sequence len cutoff; discard the entire sequence if > this number; default: 0
                -nr             [0-1]   Ambuguous sequence percentage cutoff; discard the entire sequence if > this number; default: 1
                -minlen         [int]   Minimum sequence length filter after clean up; default: 100 (bp)
                -maxlen         [int]   Maximum sequence length filter after clean up; default: 25000 (bp)
                -cleanN         [0|1]   Retain (0) or remove (1) the -misschar taget in output sequence; default: 0
                -cleanT         [0|1]   Remove entire seq. if any terminal seq (20bp) has 15bp of N (1); disabled by default (0).
                -minrm          [int]   The minimum length of -misschar to be removed if -cleanN 1; default: 1.
                -trf            [0|1]   Enable (1) or disable (0) tandem repeat finder (trf); default: 1
                -trf_path       path    Path to the trf program
        

ERROR: Input sequence file is not exist!
Iteratively clean up nested TE insertions and remove redundancy.

Further info:
Each sequence will be used as query to search the entire file.
For a subject sequence containing >95% of the query sequence, the matching part in the subject will be removed.
After removal, subject sequences shorter than the threadshold will be diacarded.
The number of rounds of iterations is automatically decided (usually less than 8). User can also define this.

Usage:
perl cleanup_nested.pl -in file.fasta [options]
-in     [file]  Input sequence file in FASTA format
-cov    [float] Minimum coverage of the query sequence to be considered as nesting. Default: 0.95
-minlen [int]   Minimum length of the clean sequence to retain. Default: 80 (bp)
-miniden        [int]   Minimum identity of the clean sequence to retain. Default: 80 (%)
-clean  [int]   Clean nested sequences (1) or not (0). Default: 1
-iter   [int]   Numbers of iteration to remove redundency. Default: automatic
-blastplus [path]       Path to the blastn and makeblastdb program.
-threads|-t     [int]   Threads to run this script. Default: 4

cat: genome.fa.mod.TIR.Helitron.fa.stg1.raw.cln.cln: No such file or directory
2023年 12月 11日 星期一 19:53:02 CST    EDTA advance filtering finished.

2023年 12月 11日 星期一 19:53:02 CST    Perform EDTA final steps to generate a non-redundant comprehensive TE library:

                                Use RepeatModeler to identify any remaining TEs that are missed by structure-based methods.

cat: 'RM_*/consensi.fa': No such file or directory
                                RepeatModeler is finished, but no consensi.fa files found.

                                Skipping the CDS cleaning step (--cds [File]) since no CDS file is provided or it's empty.

2023年 12月 11日 星期一 19:53:40 CST    EDTA final stage finished! You may check out:
                                The final EDTA TE library: genome.fa.mod.EDTA.TElib.fa
2023年 12月 11日 星期一 19:53:40 CST    Perform post-EDTA analysis for whole-genome annotation:

I'm not sure if these errors have any effect. Can you give me some advice?

Wanjie-Feng avatar Dec 11 '23 12:12 Wanjie-Feng