EDTA icon indicating copy to clipboard operation
EDTA copied to clipboard

TE_XXX in gff3 from panEDTA

Open CongLiu37 opened this issue 1 month ago • 0 comments

Hello,

I am using EDTA+panEDTA to annotate genomes of 40 related species. I annotated each genome individually with EDTA v2.2.0 and generated a panEDTA library. Then for each genome, I run

RepeatMasker -e ncbi -pa 40 -q -div 40 -lib ${panEDTA.TElib} -cutoff 225 -gff ${genome}.mod.panEDTA > /dev/null
perl -i -nle 's/\s+DNA\s+/\tDNA\/unknown\t/; print $_' ${genome}.mod.panEDTA.out
EDTA.pl --genome ${genome}, -t 40 --step final --anno 1 --curatedlib ${panEDTA.TElib} --cds ${cds} --rmout ${genome}.mod.panEDTA.out

These are copy-paste from panEDTA.sh for parallization.

In my understanding, each sequence in the panEDTA TE library should represent a TE family. I am trying to extract genomic sequences for each TE family. I found some unusual Names in attributes field of TEanno.gff3: (1) There are some panTE_XXX in gff3 but not in panEDTA.TElib. Instead, there are panTE_XXX_INT and panTE_XXX_LTR in panEDTA.TElib. (2) There are TE_XXX in gff3, but not in panEDTA.TElib.

Lastly, how would you count the copy number of each TE family? I checked the ratio between length of regions in the gff3 and of corresponding sequences in panEDTA.TElib, and it differs a lot. Here are quantiles of the ratio:

> quantile(df$lengthABOVETE.fam.len,na.rm =TRUE,probs=seq(0,1,0.1))
          0%          10%          20%          30%          40%          50% 
 0.005845817  0.080485612  0.116917626  0.162465915  0.221638655  0.288018433 
         60%          70%          80%          90%         100% 
 0.376657825  0.494324624  0.678725237  0.937500000 73.812785388 

I suspect whether these extremely short/long regions are really transposons and I am not sure whether it is a good idea to include them in analysis analysis on evolution of individual TE family (e.g. copy number dynamics). Do you have any suggestion?

Sincerely,

Cong

CongLiu37 avatar May 09 '24 09:05 CongLiu37