EDTA
EDTA copied to clipboard
panEDTA path specification
Hi Shujun,
Thanks as always for EDTA!
I think this is a silly question - I'm having trouble figuring out how to specify the paths for the genomes I'm putting into panEDTA so the existing annotations will be found rather than rerun. I've tried both putting the original genome paths in the genomes list file and copying the TE annotations to the directory where I'm running panEDTA but I don't think I did it right.
Thanks!
Joanna
Hi Joanna,
I recently updated panEDTA, please try out this new version and let me know. Simply unzip and replace the original panEDTA.sh script in the EDTA folder. The help info of the scrip is also updated. panEDTA.sh.zip
Thanks! Shujun
Hi Shujun,
Thanks! I have the new version running, and I'm getting the following error:
ERROR: Raw LTR results not found in Crubella_474_v1_names_shortened.1.cds.fa.mod.EDTA.raw/Crubella_474_v1_names_shortened.1.cds.fa.mod.LTR.raw.fa If you believe the program is working properly, this may be caused by the lack of intact LTRs in your genome. Consider to use the --force 1 parameter to overwrite this check ERROR: Initial EDTA failed for Crubella_474_v1_names_shortened.1.cds.fa
It seems to fail on annotating one of the CDS files, which it wasn't doing before. Is this expected? I've attached my genome paths file - this cds also created a bunch of line length warnings in the updated but not new version so I'm wondering if there's something up with spacers or EOL encoding in my input.
Thanks!
Joanna
Hi Joanna,
EDTA is not designed to annotate TEs in CDS files. You need to provide the whole genome to the program, and you may use the CDS file to facilitate the removal of genes in the TE annotation.
Thanks, Shujun
Hi Shujun,
Thanks!
I know it's not supposed to be trying to annotate the TEs in the CDS - I don't understand why it is trying to do so. When I ran it before, it correctly interpreted the CDS paths in the second column as facilitating gene removal, but now the error messages suggest the CDS files are being read in as genomes to annotate instead.
Cheers,
Joanna
On Tue, Aug 8, 2023 at 3:46 PM Shujun Ou @.***> wrote:
Hi Joanna,
EDTA is not designed to annotate TEs in CDS files. You need to provide the whole genome to the program, and you may use the CDS file to facilitate the removal of genes in the TE annotation.
Thanks, Shujun
— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/375#issuecomment-1670212018, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPL6CR5767M5GO7ZVO3SI3XUKJQDANCNFSM6AAAAAA26PSLZQ . You are receiving this because you authored the thread.Message ID: @.***>
-- Joanna Rifkin PhD they/them Postdoctoral fellow at the University of Michigan
Hi again!
I tried rerunning it with the genomes + CDS list in a different order and got this error:
Wed Aug 9 15:24:54 EDT 2023 ERROR: Fail to convert seq IDs to <= 13 characters! Please provide a genome with shorter seq IDs. ERROR: Initial EDTA failed for Cviolacea_585_v2.1.cds_primaryTranscriptOnly.fa
So it definitely appears to be trying to run EDTA on the CDS files rather than recognizing them as CDS files with sequence to exclude.
Cheers,
Joanna
Hi Joanna,
It appears to be a bug, and sorry for the issue. I have a colleague currently testing this. For the moment, if you can use CDS files of closely related species to replace the genomes without CDS files, and make two complete columns of the genome list file, it should be able to bypass.
Thanks, Shujun
Hi Shujun,
Gotcha. So instead of:
genome genome cds genome cds
I should have
genome [related cds] genome cds genome cds
Right?
Would it be better to just fill in the Arabidopsis CDS for all the species without their own CDS, or try to find something closer? (They're all in the Brassicaceae.)
Thanks!
Joanna
On Wed, Aug 9, 2023 at 3:34 PM Shujun Ou @.***> wrote:
Hi Joanna,
It appears to be a bug, and sorry for the issue. I have a colleague currently testing this. For the moment, if you can use CDS files of closely related species to replace the genomes without CDS files, and make two complete columns of the genome list file, it should be able to bypass.
Thanks, Shujun
— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/375#issuecomment-1672029721, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPL6CS3FHDIAH245YHPM2DXUPQ4HANCNFSM6AAAAAA26PSLZQ . You are receiving this because you authored the thread.Message ID: @.***>
-- Joanna Rifkin PhD they/them Postdoctoral fellow at the University of Michigan
Yes, that's correct. Try something closer but with good gene annotation quality, otherwise, Arabidopsis works perfectly fine.
Shujun
Excellent, thanks!
On Wed, Aug 9, 2023 at 3:51 PM Shujun Ou @.***> wrote:
Yes, that's correct. Try something closer but with good gene annotation quality, otherwise, Arabidopsis works perfectly fine.
Shujun
— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/375#issuecomment-1672052919, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPL6CVW6RTIMCLQK5CIBTTXUPS3JANCNFSM6AAAAAA26PSLZQ . You are receiving this because you authored the thread.Message ID: @.***>
-- Joanna Rifkin PhD they/them Postdoctoral fellow at the University of Michigan
Hi again!
I tried just using the Arabidopsis CDS seem to get the same issue:
Error: Error while loading sequence perl make_bed_with_intact.pl EDTA.intact.fa > EDTA.intact.bed
Wed Aug 9 17:47:28 EDT 2023 Warning: The Helitron result file has 0 bp!
Wed Aug 9 17:47:28 EDT 2023 Execution of EDTA_raw.pl is finished!
ERROR: Raw LTR results not found in Araport11_cds_20220914.mod.EDTA.raw/Araport11_cds_20220914.mod.LTR.raw.fa If you believe the program is working properly, this may be caused by the lack of intact LTRs in your genome. Consider to use the --force 1 parameter to overwrite this check ERROR: Initial EDTA failed for Araport11_cds_20220914
I can focus on some other projects until your colleague has tracked down the bug. Let me know if any of my full logs or commands will be helpful for the troubleshooting process!
Cheers,
Joanna
Hi Joanna,
The easiest "fix" is to run the code with bash
, not sh
, zsh
or other shell variants: bash panEDTA.sh ...
Let me know if you still have trouble running it. I will also try to update the code and make it more adaptive to shell variants.
Thanks! Shujun
Hi Shujun,
Thanks, I'll give that a try.
Cheers,
Joanna
On Tue, Aug 15, 2023 at 11:31 PM Shujun Ou @.***> wrote:
Hi Joanna,
The easiest "fix" is to run the code with bash, not sh, zsh or other shell variants: bash panEDTA.sh ...
Let me know if you still have trouble running it. I will also try to update the code and make it more adaptive to shell variants.
Thanks! Shujun
— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/375#issuecomment-1679907362, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPL6CUIDBRPJGK7BOWEISTXVQ5KTANCNFSM6AAAAAA26PSLZQ . You are receiving this because you authored the thread.Message ID: @.***>
-- Joanna Rifkin PhD they/them Postdoctoral fellow at the University of Michigan
Hi Shujun,
I've tried submitting the job with "bash" rather than "sh" but it doesn't seem to change the main problem, where it's trying to annotate the CDS file. Here's some output from the log file:
Fri Aug 18 11:47:52 EDT 2023Pan-genome Extensive de-novo TE Annotator (panEDTA) Output directory: /nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/panEDTA/comparative_brassicaceae_run_8-7-2023 Genome files: genome_list_paths_updated.txt Coding sequences: ../Araport11_cds_20220914 Curated library: Copy number cutoff: 3 CPUs: 16Fri Aug 18 11:47:52 EDT 2023De novo annotate genome Araport11_cds_20220914 with EDTA############################################################# Extensive de-novo TE Annotator (EDTA) v2.1.3 ######### Shujun Ou @.** @.**>) ############################################################Fri Aug 18 11:47:55 EDT 2023 Dependency checking: All passed!Fri Aug 18 11:48:01 EDT 2023 The longest sequence ID in the genome contains 378 characters, which is longer than the limit (13) Trying to reformat seq IDs... Attempt 1...Fri Aug 18 11:48:02 EDT 2023 Seq ID conversion successful! A CDS file Araport11_cds_20220914 is provided via --cds. Please make sure this is the DNA sequence of coding regions only.
This is the genome list:
/nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Euclidium_genome/GCA_900116095.1_Euclidium_syriacum.MPIPZ.v1_genomic.fna ../Araport11_cds_20220914/nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/C_violacea/Cviolacea_585_v2.0.fa /nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/C_violacea/Cviolacea_585_v2.1.cds_primaryTranscriptOnly.fa/nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Crubella/Crubella_474_v1.fa /nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Crubella/Crubella_474_v1_names_shortened.1.cds.fa/nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Brapa/BrapaFPsc_277_v1.fa /nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Brapa/BrapaFPsc_277_v1.3.cds_primaryTranscriptOnly.fa/nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Alyrata/Alyrata_384_v1.fa /nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Alyrata/Alyrata_384_v2.1.cds_primaryTranscriptOnly.fa/nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/M_pygmaea/M_pygmaea_names_fixed.genome.fasta ../Araport11_cds_20220914/nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Aalpina/Arabis_alpina.MPIPZ.version_5.1.chr.all.fasta ../Araport11_cds_20220914/nfs/turbo/rsbaucom/lab/Hesperis_Dovetail/Hi-Rise_Assembly_September_2022/EDTA_TE_annotation/Hesperis_assembly.fasta /nfs/turbo/rsbaucom/lab/Hesperis_Dovetail/Hi-Rise_Assembly_September_2022/BRAKER3_gene_annotation/RNA_protein/braker/braker.codingseq/nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Dsophioides/Dsophioides_482_v1_short_names.fa /nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Dsophioides/Dsophioides_482_v1.1.cds_primaryTranscriptOnly.fa/nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Dstrictus/Dstrictus_582_v2.0.fa /nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Dstrictus/Dstrictus_582_v2.1.cds_primaryTranscriptOnly.fa/nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Mperfoliatum/Mperfoliatum_583_v2.0.fa /nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Mperfoliatum/Mperfoliatum_583_v2.1.cds_primaryTranscriptOnly.fa As suggested, each genome has the genome first and either its own CDS or the Arabidopsis CDS on the same line.
Here's the command I ran:
source /home/jlrifkin/setup_conda.shconda activate EDTAbash /nfs/turbo/rsbaucom/lab/SOFTWARE/EDTA/panEDTA.sh -g genome_list_paths_updated.txt -c ../Araport11_cds_20220914 -t 16 -f 3
I tried removing the -c option, but that just throws an error (Failed to parse command line / line 105: [: !=: unary operator expected Option cds requires an argument ERROR: Initial EDTA failed for Araport11_cds_20220914). It really seems like it's trying to annotate the CDS rather than using it as CDS.
Thanks!
Joanna
On Fri, Aug 18, 2023 at 11:06 AM Joanna R. @.***> wrote:
Hi Shujun,
Thanks, I'll give that a try.
Cheers,
Joanna
On Tue, Aug 15, 2023 at 11:31 PM Shujun Ou @.***> wrote:
Hi Joanna,
The easiest "fix" is to run the code with bash, not sh, zsh or other shell variants: bash panEDTA.sh ...
Let me know if you still have trouble running it. I will also try to update the code and make it more adaptive to shell variants.
Thanks! Shujun
— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/375#issuecomment-1679907362, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPL6CUIDBRPJGK7BOWEISTXVQ5KTANCNFSM6AAAAAA26PSLZQ . You are receiving this because you authored the thread.Message ID: @.***>
-- Joanna Rifkin PhD they/them Postdoctoral fellow at the University of Michigan
-- Joanna Rifkin PhD they/them Postdoctoral fellow at the University of Michigan
Hi Joanna,
Sorry for the delay. This version passes tests on my end with bash, please try it out and let me know. Let me know if you have any suggestions. Thank you!
Shujun
Hi Shujun,
Good news! It's not trying to annotate TEs in CDS any more.
It doesn't seem to be able to locate CDS with a path outside of the directory it's being run in, so I put local symlinks to every CDS - not a problem really but perhaps something to know about. If I used the actual, functioning path, it said the CDS didn't exist, but if I made a symlink to the directory I'm running it in it was fine.
I'll keep you updated!
Thanks for helping troubleshoot!
All the best,
Joanna
On Mon, Aug 21, 2023 at 1:26 AM Shujun Ou @.***> wrote:
Hi Joanna,
Sorry for the delay. This version passes tests on my end with bash, please try it out and let me know. Let me know if you have any suggestions. Thank you!
panEDTA.sh.txt https://github.com/oushujun/EDTA/files/12392139/panEDTA.sh.txt
Shujun
— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/375#issuecomment-1685664209, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPL6CWQELELAJ6AJTEFBXLXWLWRRANCNFSM6AAAAAA26PSLZQ . You are receiving this because you authored the thread.Message ID: @.***>
-- Joanna Rifkin PhD they/them Postdoctoral fellow at the University of Michigan
Hi again,
I think this is still possibly an issue: when I try to run panEDTA with symlinks to completed EDTA runs in other directories, it makes this non-functioning symlink for every genome:
Crubella_474_v1.fa.mod.EDTA.TElib.novel.fa -> /nfs/turbo/rsbaucom/lab/Comparative_Brassicaceae_TEs/Crubella/Crubella_474_v1.fa.mod.EDTA.TElib.novel.fa
It apparently completes, but actually doesn't reannotate the genomes with the panEDTA library, but instead with the same TEs as before, and prints the following error many times in the error log:
grep: BrapaFPsc_277_v1.fa.mod.EDTA.TElib.novel.fa: No such file or directory
If I run the same set of genomes locally from scratch, I don't have this problem. I haven't tried copying the previous runs into the same directory where I'm running panEDTA.
I'm just rerunning it with all the genomes in the same place, but was hoping to avoid doing that to save time (and because one of the genomes is large and highly repetitive).
Let me know if I can help debug this with any additional data.
Thanks!
Joanna Thanks!
Joanna
Hi Joanna,
I finally got this updated. Can you please update panEDTA (or the entire EDTA repo) and try the symlinks again? I tested locally and now it works with either the sh
or bash
way of running it.
Shujun
Hi Shujun,
Thanks! I'm just running panEDTA and doing everything de novo in sequence, and that seems to be working fine. I figured since everything needed to be redone with EDTA2 anyway I could just do it the slow way. But I'll try the updated version next time and I'm excited about the efficiency!
Cheers,
Joanna
On Mon, Feb 19, 2024 at 1:28 AM Shujun Ou @.***> wrote:
Hi Joanna,
I finally got this updated. Can you please update panEDTA (or the entire EDTA repo) and try the symlinks again? I tested locally and now it works with either the sh or bash way of running it.
Shujun
— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/375#issuecomment-1951786801, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPL6CQ3OR7LHZICQSCQJS3YULWKNAVCNFSM6AAAAAA26PSLZSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJRG44DMOBQGE . You are receiving this because you authored the thread.Message ID: @.***>
-- Joanna Rifkin PhD they/them Computational biologist
If the issue is resolved, I will close this thread. Please reopen or open a new thread if you have different issues.
Thank you for your patience! Shujun
Great, thank you!
On Mon, Mar 18, 2024 at 4:38 PM Shujun Ou @.***> wrote:
If the issue is resolved, I will close this thread. Please reopen or open a new thread if you have different issues.
Thank you for your patience! Shujun
— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/375#issuecomment-2004944167, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFPL6CXGYTXR47U4YAATNPTYY5GF7AVCNFSM6AAAAAA26PSLZSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBUHE2DIMJWG4 . You are receiving this because you authored the thread.Message ID: @.***>
-- Joanna Rifkin PhD they/them Computational biologist