EDTA icon indicating copy to clipboard operation
EDTA copied to clipboard

LINES vs LTR content

Open mason-linscott opened this issue 2 years ago • 2 comments

Hi @oushujun,

I have been running the EDTA pipeline on several mollusc genomes to compare repeat content between species. My primary focus has been on examining how LTR content differs between previously published assemblies and a new de novo genome. I chose EDTA since it has a much faster LTR detection process than RepeatModeler (RM fails to run in less than one month on a 1TB memory, 48 thread node using the -LTRStruct parameter ) for many of the large genome species.

Several of the species I have analyzed have reported repeat content (based on RepeatModeler2) which is different from those detected with EDTA. Particularly, it appears LINE content is nearly non-existent compared to previous estimates and LTR content is much greater. However, none of the aforementioned genomes were run with the LTRstruct parameter of RepeatModeler2. Perhaps the LINE elements detected by RepeatModeler2 without the LTRStruct pipeline are nested within LTRs of these genomes (as you suggest here). Have you ever compared RepeatModeler2 results with and without the LTRStruct pipeline to EDTA results? Or is there another possible explanation for this discrepancy?

I think this issue is similar to issues #58 and #196 but my analysis was performed on the most recent version (v.1.9.9) which was supposed to address non-detection of LINE elements. I have attached the RepeatMasker output of the TELib produced by EDTA for two of the genomes and RepeatModeler2 results from the pubs. Other assemblies I am comparing to did not publish the raw RepeatMasker output or used an older version of RepeatModeler.

I would appreciate your thoughts on this and I wish you all the best, M.

candidula_edta_out.txt cepaea_edta_out.txt

candiula_rm_out.txt cepaea_rm_out.txt

mason-linscott avatar Nov 15 '21 15:11 mason-linscott

Hi @mason-linscott,

You can find RepeatModeler2 benchmarks here: https://github.com/oushujun/EDTA/issues/39#issuecomment-611757382. You have reached many good discussions in previous issues and they still stand. I don't have a good "automated" solution for nonLTRs at the moment, and please don't solely rely on RepeatModeler2 if nonLTRs are big things in your genome, especially for non-plant species.

EDTA is a good start but you may still need to do some manual curations for it to perform better. I would suggest you find the most abundant LINE elements in your genomes, curate them, and provide their sequences to EDTA via --curatedlib file.

Best, Shujun

oushujun avatar Nov 15 '21 17:11 oushujun

Why not MGEScan-non-LTR?

amvarani avatar Dec 28 '21 23:12 amvarani