EDTA icon indicating copy to clipboard operation
EDTA copied to clipboard

Discrepancies between EDTA and RepeatMasker results, how to combine?

Open lx-1011 opened this issue 2 years ago • 12 comments

Dear @oushujun , Thank you for developing this useful tool. I have run it successfully and found that SINEs and LINEs could't be identified based on structure features. However, we didn't have such TE lib tons of manual curations in pigs. We tried to combine the results of EDTA and RepeatMasker for an entire TE identification, but here were some different results: 5b224e439ad26fb5ade34388c4aac7e c2b3b0478a24fe7b67c64d4462586ed

  1. TEs make up nearly 40% of mammalian genomes[1]. EDTA can identify 31.09%, and RepeatMasker can identify 37.31%. Was the difference nearly 6% caused by identification of SINEs and LINEs?.
  2. About the result difference of EDTA and RepeatMasker, do you have a better suggestion for the arrangement of the two results?
    ####Not about EDTA#######
  3. Most mammalian genomes are dominated by LINE and SINE retrotransposons, more limited LTR retrotransposons, and minimal DNA transposon accumulation[2]. However, we didn't identify any SINEs in pig genome, and only 3 LINEs using EDTA. Do you have any idea about that?

Thanks and wish you all the best Li Xin

[1] Isaac A Babarinde, Gang Ma, Yuhao Li, Boping Deng, Zhiwei Luo, Hao Liu, Mazid Md Abdul, Carl Ward, Minchun Chen, Xiuling Fu, Liyang Shi, Martha Duttlinger, Jiangping He, Li Sun, Wenjuan Li, Qiang Zhuang, Guoqing Tong, Jon Frampton, Jean-Baptiste Cazier, Jiekai Chen, Ralf Jauch, Miguel A Esteban, Andrew P Hutchins, Transposable element sequence fragments incorporated into coding and noncoding transcripts modulate the transcriptome of human pluripotent stem cells, Nucleic Acids Research, Volume 49, Issue 16, 20 September 2021, Pages 9132–9153, https://doi.org/10.1093/nar/gkab710 [2] Platt, R.N., Vandewege, M.W. & Ray, D.A. Mammalian transposable elements and their impacts on genome evolution. Chromosome Res 26, 25–43 (2018). https://doi.org/10.1007/s10577-017-9570-z

lx-1011 avatar Oct 25 '21 11:10 lx-1011

Dear Li Xin,

Sorry for the delayed response. If you compare TE categories side by side, you may find many of them have quite big differences. In my opinion, the major discrenpcy comes from the failure to identify SINE and LINE by EDTA, which may have inflated the TIR category (i.e., CACTA and mutator).

It's a good sign is that RepeatMasker can identify LINEs. A better way to combine the two is to find out which LINE sequences in the RepBase were used for the annotation, then obtain those library sequences from Repbase or somewhere (i.e. NCBI), and format their names into the RepeatMasker format (example, EDTA/database/rice6.9.5.liban.nonLTR), and feed them to EDTA via --curatedlib, then EDTA should perform much better. If you know of any pig TEs, they don't have to be comprehensive, giving them to EDTA via --curatedlib will be also a good idea.

Best, Shujun

oushujun avatar Nov 10 '21 00:11 oushujun

Dear @oushujun

Thanks for your response. I have tried to carry out your suggestion, and still have some question, like that:

  1. I try to get LINE sequences through their position in reference.fa (the result of RepeatMaker based on Dfam database). But the name doesn't meet needs. image

  2. Then format their names like rice6.9.5.liban.nonLTR, chr:pos-end#LINE/L1 match=RM > LINE.fa image

  3. Run EDTA.pl again. perl ~/lixin/software/EDTA/EDTA.pl --genome ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.dna.chromosome.12.fa --species others --step all --cds ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.cds.all.fa --overwrite 1 --sensitive 1 --anno 1 --evaluate 1 --threads 20 --curatedlib ../03.RepeatModeler_RepeatMasker/LINE/LINE.fa

###log file 2021-11-18 15:35:06,300 -INFO- Summary of classifications: Order Superfamily # of Sequences# of Clade Sequences # of Clades# of full Domains LTR Bel-Pao 13 0 0 0 LTR Copia 175 64 13 0 LTR Gypsy 152 111 13 0 LTR Retrovirus 16 0 0 0 LTR mixture 1 0 0 0 DIRS unknown 7 0 0 0 LINE unknown 1116 0 0 0 TIR MuDR_Mutator 2 0 0 0 TIR PIF_Harbinger 1 0 0 0 TIR PiggyBac 2 0 0 0 TIR Tc1_Mariner 11 0 0 0 TIR hAT 32 0 0 0 Helitron unknown 4 0 0 0 Maverick unknown 252 0 0 0 2021-11-18 15:35:06,304 -INFO- Pipeline done. 2021-11-18 15:35:06,305 -INFO- cleaning the temporary directory ./tmp Remove CDS-related sequences in the EDTA library:

Thu Nov 18 15:41:00 CST 2021 **Combine the high-quality TE library LINE.fa with the EDTA library:

(EDTA) cche@sg04 15:51:57** ~/lixin/02_sus_pop/06annotation/02TE_annotation/tt_EDTA_LINE $ ###No any err reported, only interrupt. I have tried it twice, and the results are same.

  1. try to use the first four lines of rice6.9.5.liban.nonLTR as curatedlib, run again perl ~/lixin/software/EDTA/EDTA.pl --genome ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.dna.chromosome.12.fa --species others --step all --cds ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.cds.all.fa --sensitive 1 --anno 1 --evaluate 1 --threads 20 --curatedlib rice6.9.5.liban.nonLTR

####log file 2021-11-21 00:49:51,972 -INFO- Summary of classifications: Order Superfamily # of Sequences# of Clade Sequences # of Clades# of full Domains LTR Bel-Pao 13 0 0 0 LTR Copia 175 64 13 0 LTR Gypsy 152 111 13 0 LTR Retrovirus 16 0 0 0 LTR mixture 1 0 0 0 DIRS unknown 7 0 0 0 LINE unknown 1116 0 0 0 TIR MuDR_Mutator 2 0 0 0 TIR PIF_Harbinger 1 0 0 0 TIR PiggyBac 2 0 0 0 TIR Tc1_Mariner 11 0 0 0 TIR hAT 32 0 0 0 Helitron unknown 4 0 0 0 Maverick unknown 252 0 0 0 2021-11-21 00:49:51,976 -INFO- Pipeline done. 2021-11-21 00:49:51,976 -INFO- cleaning the temporary directory ./tmp Remove CDS-related sequences in the EDTA library:

Sun Nov 21 00:57:17 CST 2021 Combine the high-quality TE library rice6.9.5.liban.nonLTR with the EDTA library:

**Input file "Sus_scrofa.Sscrofa11.1.dna.chromosome.12.fa.mod.EDTA.TElib.fa.masked" not found!**
  1. That's a good idea! I'd appreciate your suggestions!

Thanks and wish you all the best Li Xin

lx-1011 avatar Nov 21 '21 05:11 lx-1011

Hi Li Xin,

You may only select those high-copy LINE annotations from the RepeatMasker output, and generate non-redundant sequences from them. You may manually select the ones that you think are representative, or use consensus to generate a representative sequence from sequences of each family. Please DON'T give all RepeatMasker sequences to EDTA.

The name formatting looks good to me, but I don't understand what do you mean by interruption. Please include full reports in the attachment so that I can better judge what may be the issue.

Best, Shujun

oushujun avatar Nov 21 '21 18:11 oushujun

Hi @oushujun , Thanks for your response, I have run test_file sussessfully with your suggestion, and i will show the detail later. And then I run the whole genome using EDTA with LINS_SINE.data.fa which identified by RepeatMasker about 177246 (173080 LINEs, 4166 SINEs) from 23th, Nov to now.

  1. Nearly 56 days. I am not sure it is normal or not. Is that anyway to short this time?
  2. Annotation is running now, and only find LINEs , no SINEs in $.EDTA.TEanno.sum. Is it because the task was not completed? image

###01input image

###02LINE_SINE.data.fa image

###03current proceeding image

test_file in details(obtain LINE10.fa from RepeatMaker) #01 perl ~/lixin/software/EDTA/EDTA.pl --genome ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.dna.chromosome.12.fa --species others --step all --cds ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.cds.all.fa --sensitive 1 --anno 1 --evaluate 1 --threads 20 --curatedlib LINE10.fa #02 LINE10.fa (5 lines) image #03 output (nearly 25h) image image

lx-1011 avatar Jan 18 '22 02:01 lx-1011

Hi,

Apparently, you are providing all SINE/LINE annotations to EDTA - you should not do that. Please only provide exemplary sequences (aka, non-redundant library sequences) to EDTA. Doing so will make your run very slow (as you mentioned, 56 days) and the annotation is just not right.

Shujun

On Mon, Jan 17, 2022 at 9:17 PM lx-1011 @.***> wrote:

Hi @oushujun https://github.com/oushujun , Thanks for your response, I have run test_file sussessfully with your suggestion, and i will show the detail later. And then I run the whole genome using EDTA with LINS_SINE.data.fa which identified by RepeatMasker about 177246 (173080 LINEs, 4166 SINEs) from 23th, Nov to now.

  1. Nearly 56 days. I am not sure it is normal or not. Is that anyway to short this time?
  2. Annotation is running now, and only find LINEs , no SINEs in $.EDTA.TEanno.sum. Is it because the task was not completed? [image: image] https://user-images.githubusercontent.com/47030888/149858908-6bb0470b-5aef-44ad-8489-344af6f75458.png

###01input [image: image] https://user-images.githubusercontent.com/47030888/149857267-89d48eaa-6d9c-49da-ab9e-10ff2c221784.png

###02LINE_SINE.data.fa [image: image] https://user-images.githubusercontent.com/47030888/149857186-7837abd1-3d67-4bb0-9b75-5c2a489f809a.png

###03current proceeding [image: image] https://user-images.githubusercontent.com/47030888/149857460-79f7e34f-a76a-45ec-9a7f-db2698f573a3.png

test_file in details(obtain LINE10.fa from RepeatMaker) #1 https://github.com/oushujun/EDTA/pull/1 perl ~/lixin/software/EDTA/EDTA.pl --genome ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.dna.chromosome.12.fa --species others --step all --cds ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.cds.all.fa --sensitive 1 --anno 1 --evaluate 1 --threads 20 --curatedlib LINE10.fa #2 https://github.com/oushujun/EDTA/issues/2 LINE10.fa (5 lines) [image: image] https://user-images.githubusercontent.com/47030888/149857774-c1d9dcd4-8bc3-4014-9a2a-f9cd295cd6bf.png #3 https://github.com/oushujun/EDTA/issues/3 output (nearly 25h) [image: image] https://user-images.githubusercontent.com/47030888/149858049-8c49e59c-0a7e-490b-aa15-761455bf586b.png [image: image] https://user-images.githubusercontent.com/47030888/149858081-bfd0c0c2-821f-4884-8c0e-5f1ff1843351.png

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/231#issuecomment-1015011698, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NDKE75QDHWXIME6YGDUWTEUTANCNFSM5GVBX7ZA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

oushujun avatar Jan 18 '22 23:01 oushujun

Hi, @oushujun Thanks for your response.

  1. We filtered the result of RepeatMasker (SW score 300, Length 80, div 80), and the number of overlap region of that result accounted for 4% , nearly a few bps (mostly 1-10bp). and then run EDTA with the filtered RM database.
  2. Another question is that if the output of EDTA counld be used as the input database of RepeatMasker , and then combine RM and EDTA.

Li Xin

lx-1011 avatar Jan 19 '22 04:01 lx-1011

Hi Li Xin,

Is your RM database redundant or not? You may only provide non-redundant sequences to EDTA. This means you need to use one sequence to represent the entire family that it belongs to, and provide a collection of these representative sequences to EDTA. EDTA will use these sequences to perform homological annotation to other similar sequences with RepeatMasker that was integrated into EDTA. If you provide redundant sequences to EDTA, it will use these sequences to repetitively annotate your genome, which is super slow and not meaningful.

The homology result will then be combined with structural results as the final output of EDTA. So there is no need to perform RepeatMasker annotation again.

Shujun

oushujun avatar Jan 22 '22 04:01 oushujun

Hi Li Xin,

Is your RM database redundant or not? You may only provide non-redundant sequences to EDTA. This means you need to use one sequence to represent the entire family that it belongs to, and provide a collection of these representative sequences to EDTA. EDTA will use these sequences to perform homological annotation to other similar sequences with RepeatMasker that was integrated into EDTA. If you provide redundant sequences to EDTA, it will use these sequences to repetitively annotate your genome, which is super slow and not meaningful.

The homology result will then be combined with structural results as the final output of EDTA. So there is no need to perform RepeatMasker annotation again.

Shujun

Hi Shujin, I run it again and the process is running now. The curatedlib has been filtered by CD-HIT, while It still takes nearly 34 days.

image

perl ~/lixin/software/EDTA-2.0.0/EDTA.pl --genome ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.dna.chr.fa --species others --step all --cds ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.cds.all.fa --overwrite 1 --anno 1 --evaluate 1 --threads 10 --curatedlib RModeler_pig_rm_merged.rmDup.fa

RModeler_pig_rm_merged.rmDup.fa is derived from Libraries/RepeatMaskerLib.h5 image

Thanks and wish you all the best Li Xin

lx-1011 avatar Mar 15 '22 07:03 lx-1011

Hi Li Xin,

It should not take this long. The pig genome is not that big, which means you were doing something not right. If your job runs longer than a week, you should be trying to identify any issues.

I think the issue is the --curatedlib you provided to EDTA. How large is it? Judging from the file name RModeler_pig_rm_merged.rmDup.fa, is it generated by RepeatModeler initially, then used RepeatMasker to mask the pig genome, then you extracted the masked sequences, then you removed duplications with CD-HIT? If this is the case, you are doing it wrong.

Both EDTA and RepeatModeler can generate a non-redundant TE library. What you want to do is to use the SINE/LINE elements in the RepeatModeler library to boost the annotation of EDTA. So you may extract SINE/LINE sequences from the RepeatModeler library, format the sequence names, and provide them to EDTA via --curatedlib.

You may also want to read the EDTA paper for how it works.

Shujun

oushujun avatar Mar 15 '22 17:03 oushujun

any luck?

oushujun avatar May 24 '22 03:05 oushujun

any luck? Hi Shujun, Thanks for your response. I have run it successfully.

Whole-genome TE annotation (total TE: 35.04%): Sus_scrofa.Sscrofa11.1.dna.chr.fa.mod.EDTA.TEanno.gff3

image However, the results show that the percent of TE annotation is lower than expected, and it still can't identify SINEs.

perl ~/lixin/software/EDTA-2.0.0/EDTA.pl --genome ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.dna.chr.fa --species others --step all --cds ~/lixin/00_reference_genome/ensembl/Sus_scrofa.Sscrofa11.1.cds.all.fa --overwrite 1 --anno 1 --evaluate 1 --threads 10 --curatedlib ../line_sine.fa

line_sine.fa filted by CD-HIT included 233 LINEs and 36 SINEs.

lx-1011 avatar May 24 '22 05:05 lx-1011

Thanks for the update. Can you articulate which superfamily or class of TEs is lower than expected? Your result suggests that the 36 SINEs provided are not annotating any SINE elements in your genome. You may use AnnoSINE to generate the SINE library.

Shujun

oushujun avatar May 24 '22 19:05 oushujun