EDTA
EDTA copied to clipboard
Structural TEs appear not to be renamed by panEDTA
Hi Shujun,
Another question. I'm running panEDTA on a bunch of species. It says it's successfully reannotating the structurally annotated TEs, e.g.:
Sat Oct 14 15:10:05 EDT 2023 EDTA final stage finished! You may check out: The final EDTA TE library: Cviolacea_585_v2.0.fa.mod.EDTA.TElib.fa Family names of intact TEs have been updated by genome_list_local.txt.panEDTA.TElib.fa: Cviolacea_585_v2.0.fa.mod.EDTA.intact.gff3 Comparing to the provided library, EDTA found these novel TEs: Cviolacea_585_v2.0.fa.mod.EDTA.TElib.novel.fa The provided library has been incorporated into the final library: Cviolacea_585_v2.0.fa.mod.EDTA.TElib.fa
In the output, both Cviolacea_585_v2.0.fa.mod.EDTA.TElib.fa and genome_list_local.txt.panEDTA.TElib.fa include numerous sequences headed "panTE," but in Cviolacea_585_v2.0.fa.mod.EDTA.intact.gff3 no TEs are annotated with the heading "panTE." Similarly, if I filter Cviolacea_585_v2.0.fa.mod.EDTA.TEanno.gff3 for method=structural, no TEs are annotated as "panTE."
The error log features a long run of repeats this message for each genome:
Unspecified/NA not found in the TE_SO database, it will not be used to rename sequences in the final annotation.
I assume this is where the problem is coming from?
This seems to have happened to all the genomes I included, and appears to be just a problem with updating the names. What information would help you solve this?
Thanks!
Joanna
Hi again,
Any sense of what's going on here?
Thanks!
Joanna
Hello again,
Just wondering whether you had any thoughts about what might be wrong here. I tried running just the reannotation command (perl $path/EDTA.pl --genome $genome -t $threads --step final --anno 1 --curatedlib $genome_list.panEDTA.TElib.fa --cds $cds_ind --rmout $genome.mod.panEDTA.out done < $genome_list) separately, but it didn't solve the problem.
Thanks,
Joanna
Hi Joanna,
Sorry about the long delay. I haven't had the chance to investigate this issue yet, but it's on my to-do list.
Thank you! Shujun
Hi Shujun,
Thanks! I know you're swamped and there's a long backlog. Sorry to pester and I look forward to hearing what the solution is when you have a chance to get to the bottom of it.
All the best,
Joanna
On Mon, Nov 13, 2023 at 7:47 AM Shujun Ou @.***> wrote:
Hi Joanna,
Sorry about the long delay. I haven't had the chance to investigate this issue yet, but it's on my to-do list.
Thank you! Shujun
— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/397#issuecomment-1808102353, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPL6CTWOLQB2JNSWDBDVP3YEIJHLAVCNFSM6AAAAAA6MKQNCCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBYGEYDEMZVGM . You are receiving this because you authored the thread.Message ID: @.***>
-- Joanna Rifkin PhD they/them Computational biologist
One thing I noticed is that there is no "panTE" naming in the pan-TE library. They were all named regularly, ie., TE_000xxxxx. TEs in the $genome_list.panEDTA.TElib.fa were used to rename structural TEs, but not all - those single copied will be named as Chrx:xxx..xxx (their coordinate), those multi-copied but not presents enough full-length copies in the genome are remained named as TE_000xxxxx. The latter presents a problem to the pan-TE library because they are named in the same format. I stated several dozen genomes and the fraction of the genome being in the second category is around 0.5% - 2.8%. I think this is an acceptable level, but I will need to change their name to something else to be distinguishable to the pan-TE libraries.
Shujun
Thanks for keeping me updated!
I'm not entirely sure I follow. In my most recent run they do seem to be renamed panTE in genome_list_local.txt.panEDTA.TElib.fa (sample from "grep '>'" below) and are in the species-specific gffs from RepeatMasker, but the new names don't make it all the way into the final [genome].fa.mod.EDTA.[intact|TEanno].gff3. But it sounds like the problem is that in each individual genome, they're being renamed in a way where panEDTA is getting them confused between genomes?
Let me know if you need anything from me!
All the best,
Joanna
panTE_00001958_LTR#LTR/Copia panTE_00001959#DNA/DTM panTE_00001960_LTR#LTR/Copia panTE_00001961_LTR#LTR/Copia panTE_00001962_INT#LTR/Gypsy panTE_00001963_INT#LTR/Copia panTE_00001964_LTR#LTR/Gypsy panTE_00001965_LTR#LTR/Copia panTE_00001966_LTR#LTR/unknown
On Mon, Nov 13, 2023 at 5:40 PM Shujun Ou @.***> wrote:
One thing I noticed is that there is no "panTE" naming in the pan-TE library. They were all named regularly, ie., TE_000xxxxx. TEs in the $genome_list.panEDTA.TElib.fa were used to rename structural TEs, but not all - those single copied will be named as Chrx:xxx..xxx (their coordinate), those multi-copied but not presents enough full-length copies in the genome are remained named as TE_000xxxxx. The latter presents a problem to the pan-TE library because they are named in the same format. I stated several dozen genomes and the fraction of the genome being in the second category is around 0.5% - 2.8%. I think this is an acceptable level, but I will need to change their name to something else to be distinguishable to the pan-TE libraries.
Shujun
— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/397#issuecomment-1809249148, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPL6CXEEIWPZHDBNNZJJDDYEKOWNAVCNFSM6AAAAAA6MKQNCCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBZGI2DSMJUHA . You are receiving this because you authored the thread.Message ID: @.***>
-- Joanna Rifkin PhD they/them Computational biologist