EDTA icon indicating copy to clipboard operation
EDTA copied to clipboard

Annotating draft genome with thousands of sequences

Open bvs opened this issue 1 year ago • 7 comments

Dear Shujun Ou,

Thank you for developing this excellent tool; I have been successfully using it for a couple of years.

Currently, I am annotating draft genome assemblies (scaffold-level) of various plant species. Several of my assemblies contain>300K scaffolds. As of now, LTR/Helitron annotation was completed successfully but I am stuck with TIR annotation using the divide and conquer approach as explained. Even after providing plenty of memory (50Gb/thread) and >21 days, the TIR-learner still runs. To fasten this TIR annotation, I found that you mentioned a solution (splitting the genome into multiple portions and running TIR annotation on each portion followed by combining the results) at https://github.com/oushujun/EDTA/issues/175. It looks promising approach to fast-track the TIR annotation. It will be a very great help if you provide more details about the mentioned approach so that I will implement it along with any known disadvantages with the approach.

Best regards, Suresh

bvs avatar Jul 05 '22 21:07 bvs

Hi Suresh,

You may need to experiment about it a little bit. The idea is to split the genome based on the number of sequences (i.e., 1000 sequences per file) and run EDTA on these files independently. Since you have LTR and Helitrons completed, you may just run the TIR step for the split files. When they are done, find the TIR result files in /*EDTA.raw and concatenate them to a single TIR result, name them following the EDTA format, and place them in the whole-genome /*EDTA.raw directory. EDTA should be able to recognize the existing TIR result if the --overwrite 0 parameter is used.

Please let me know how it goes!

Best, Shujun

oushujun avatar Jul 05 '22 22:07 oushujun

Dear Shujun,

Thank you for your quick response. I will follow this strategy. I have EDTA results for a genome that has about 20K scaffolds. I want to test whether I will get a similar/identical TIR annotation if I run EDTA_raw with TIR only on portions of the genome. As of now, I am running EDTA_raw with TIR only on each portion of the genome i.e., 1000 sequences/file. I will update you once I get the results along with a comparison against EDTA results with a single genome file.

Best regards, Suresh

bvs avatar Jul 06 '22 05:07 bvs

Hi Suresh,

Thank you for helping to test out. You may not need a lot of memory since this is not a memory issue. TIR stalls when scaffold number increases to thousands. Good luck!

Best, Shujun

oushujun avatar Jul 06 '22 13:07 oushujun

Hi Shujun,

EDTA_raw with TIR has been successfully completed. I found different results in both strategies and the summary is as follows:

  1. EDTA_raw on all portions of genome sequences (1000 scaffolds/file): 4454 TIRs
  2. EDTA_raw on single whole-genome file: 4232 TIRs
  3. Overallpping TIRs between 1 and 2 (based on fasta ID): 4224

All non-overlapping TIRs are uniquely identified in their respective strategies. Are these differences are expected? Will this cause any problems in subsequent stages of EDTA?

I observed very less amount of time was spent by TIR-Learner in the 1st compared to the 2nd strategy.

Best regards, Suresh

bvs avatar Jul 06 '22 15:07 bvs

Hi Suresh,

Thanks for testing it out! The difference in numbers should be due to the effectiveness of filtering. Since each candidate will have their flanking sequences checked for repetitiveness, smaller files have fewer sequences and are less likely to find repetitive flanking sequences. To alleviate this you may want to split the genome as less as you can, so it's a trade-off whether you get noisier results or not getting results at all.

I will say give it a try. TIR is just the first step, and EDTA will check again regarding the flanking sequences and attempt to filter out as many false positives as possible, so you still have some guarantees of good results.

Best, Shujun

oushujun avatar Jul 06 '22 16:07 oushujun

Dear Shujun, Thank you for your prompt response. I will follow this strategy for my remaining draft assemblies with >100K scaffolds. Meanwhile, I will complete this strategy with the remaining EDTA stages to see differences in the final results against the usual EDTA run.

Best regards, Suresh

bvs avatar Jul 06 '22 16:07 bvs

hello, any updates ?

Isoris avatar Dec 01 '23 07:12 Isoris