EDTA
EDTA copied to clipboard
Clarification on EDTA output: differences between `intact.fa`, `intact.gff3`, and `raw.fa`
I’m working on detecting intact LTR retrotransposons in a specific chromosome using EDTA. The goal is to understand which output file represents the final set of intact LTRs for downstream analysis.
Here’s the command I used:
EDTA_raw.pl \
--genome chromosome.fasta \
--species others \
--curatedlib curated_library.fa \
--type ltr \
--threads 40 \
--overwrite 1
In the LTR output directory, I see three files:
*.LTR.raw.fa
*.LTR.intact.raw.fa
*.LTR.intact.raw.gff3
To check the results, I compared the number of candidates reported in the annotation file pass list versus the FASTA:
grep -c "repeat_region" *.pass.list.gff3 # returns thousands of candidates
grep -c "long_terminal_repeat" *pass.list.gff3 # returns thousands of candidates
grep -c "^>" *.LTR.intact.raw.fa # returns only 2 sequences and in the beginning of the chromosome (some kbp away)
My question:
- Which of these files should be considered the final output for intact LTR detection?
- How do
raw.fa,intact.raw.fa, andintact.raw.gff3differ in terms of filtering and intended use?