EDTA
EDTA copied to clipboard
TE_XXX in gff3 from panEDTA
Hello,
I am using EDTA+panEDTA to annotate genomes of 40 related species. I annotated each genome individually with EDTA v2.2.0 and generated a panEDTA library. Then for each genome, I run
RepeatMasker -e ncbi -pa 40 -q -div 40 -lib ${panEDTA.TElib} -cutoff 225 -gff ${genome}.mod.panEDTA > /dev/null
perl -i -nle 's/\s+DNA\s+/\tDNA\/unknown\t/; print $_' ${genome}.mod.panEDTA.out
EDTA.pl --genome ${genome}, -t 40 --step final --anno 1 --curatedlib ${panEDTA.TElib} --cds ${cds} --rmout ${genome}.mod.panEDTA.out
These are copy-paste from panEDTA.sh
for parallization.
In my understanding, each sequence in the panEDTA TE library should represent a TE family. I am trying to extract genomic sequences for each TE family. I found some unusual Names
in attributes field of TEanno.gff3:
(1) There are some panTE_XXX in gff3 but not in panEDTA.TElib. Instead, there are panTE_XXX_INT and panTE_XXX_LTR in panEDTA.TElib.
(2) There are TE_XXX in gff3, but not in panEDTA.TElib.
Lastly, how would you count the copy number of each TE family? I checked the ratio between length of regions in the gff3 and of corresponding sequences in panEDTA.TElib, and it differs a lot. Here are quantiles of the ratio:
> quantile(df$lengthABOVETE.fam.len,na.rm =TRUE,probs=seq(0,1,0.1))
0% 10% 20% 30% 40% 50%
0.005845817 0.080485612 0.116917626 0.162465915 0.221638655 0.288018433
60% 70% 80% 90% 100%
0.376657825 0.494324624 0.678725237 0.937500000 73.812785388
I suspect whether these extremely short/long regions are really transposons and I am not sure whether it is a good idea to include them in analysis analysis on evolution of individual TE family (e.g. copy number dynamics). Do you have any suggestion?
Sincerely,
Cong