Introduction

This repository contains three historical versions of curated maize TE libraries derived from the Maize TE Consortium (MTEC). I combined the three together and further clean the combined library with the following commands and curations. If you are looking for a comprehensive and high-quality maize TE library, look no further, this is the one (usually named like "maizeTE02052020" in the root directory).

Files

maizeTE10102014 was download from the MTEC official website (http://maizesequence.org). The website is gone, but I managed to get a mirror of the first page. Please refer to the file history/Maizedatabase_mirror.pdf for more information about the MTEC project.
TE_12-Feb-2015_15-35.fa was shared by Nicolas Blavet from https://github.com/mcstitzer/maize_v4_TE_annotation/issues/9.
Wessler-Bennetzen_2.fasta was used to annotate the initial B73 genome (Schnable et al. 2009), which was shared by Kapeel Chougule. I believe this is an earlier version of the MTEC library.
nonTE.repeat.fa contains 5 non-TE repeats (knob180, knob TR-1, rDNA spacer, subtelomere 4-12-1, and CentC) in maize, which was shared by Jianing Liu.

Combine the three MTEC libraries + nonTE repeats

1. Reformat sequence IDs

for i in history/Wessler-Bennetzen_2.fasta history/maizeTE10102014 history/TE_12-Feb-2015_15-35.fa; do perl -nle 's/\s+$//g; $_=(split)[0]; s/\-/_/g; print $_' $i > $i.mod; done

2. Combine sequences with unique IDs

perl bin/output_by_list.pl 1 <(cat history/*.mod) 1 <(cat history/*.mod|grep \>|sort -u) -FA > history/maizeTE11212019.ori

3. Split the library into consensus and others

perl bin/output_by_list.pl 1 history/maizeTE11212019.ori 1 <(grep consen history/maizeTE11212019.ori) -FA > history/maizeTE11212019.ori.consensus

perl bin/output_by_list.pl 1 history/maizeTE11212019.ori 1 <(grep consen history/maizeTE11212019.ori) -FA -ex > history/maizeTE11212019.ori.others

4. Remove TEs in others that are represented by consensus TEs

RepeatMasker -pa 36 -div 40 -lib history/maizeTE11212019.ori.consensus -cutoff 225 history/maizeTE11212019.ori.others

perl bin/make_masked.pl -rmout history/maizeTE11212019.ori.others.out -genome history/maizeTE11212019.ori.others -maxdiv 20 -minscore 200 -minlen 80 -t 30

perl bin/cleanup_tandem.pl -nc 1000 -nr 0.5 -minlen 80 -cleanN 1 -cleanT 1 -trf 0 -f history/maizeTE11212019.ori.others.new.masked > history/maizeTE11212019.ori.others.new.masked.cln

cat history/maizeTE11212019.ori.consensus history/maizeTE11212019.ori.others.new.masked.cln > history/maizeTE11212019.ori2

5. Remove redundant sequences

perl bin/cleanup_nested.pl -in history/maizeTE11212019.ori2 -cov 0.98 -minlen 80 -miniden 95 -iter 2 -t 36

6. Remove nonTE repeats and tandem repeats

RepeatMasker -pa 36 -div 40 -no_is -norna -nolow -lib history/nonTE.repeat.fa -cutoff 225 history/maizeTE11212019.ori2.cln

perl bin/cleanup_tandem.pl -nc 1000 -nr 0.5 -minlen 80 -cleanN 1 -cleanT 1 -trf 1 -f history/maizeTE11212019.ori2.cln.masked > history/maizeTE11212019.ori2.cln2

cat history/nonTE.repeat.fa history/maizeTE11212019.ori2.cln2 > history/maizeTE11212019.ori3

Improve the combined library

1. Reclassify unknown TEs

python2 TEsorter.py history/maizeTE11212019.ori3 -p 36

2. Find misclassified entries

The file history/maizeTE11212019.ori3.rexdb.cls.tsv contains new classifications of the library. Most of them are consistent with the old classification. What really improved are the LTR/unknown classification.

perl -nle '($info, $cla)=(split)[0,2]; my $oldcla=$1 if $info=~/^([A-Z]+)_/; $cla=~s/EnSpm_CACTA/DTC/; $cla=~s/hAT/DTA/; $cla=~s/PIF_Harbinger/DTH/; $cla=~s/MuDR_Mutator/DTM/; $cla=~s/Tc1_Mariner/DTT/; $cla=~s/Gypsy/RLG/; $cla=~s/Copia/RLC/; print "$oldcla\t$cla\t$info" if $cla ne $oldcla' history/maizeTE11212019.ori3.rexdb.cls.tsv |less

3. Some LTRs appear to have the same name but different classifications (RLG/RLC/RLX)

4. Manually check misclassified sequences (`list.fa`).

These entries are put in the history/removal.list and removed:

Seq_ID	Removal reason
RLC_chr3_D_28761151	rDNA-contained
DTM_Zm08959_AC199876_1	LTRcoding-contained
DTM_Zm22805IC_AC207689_1	LINE-contained
RIX_nugimu_AC203843_0	Duplicted_with_RIL_nugimu_AC203843_0
RLX_fageri_AC204875_8470	misclassified_as_LINE
DTA_ZM00171_consensus	misclassified_as_CACTA
DTA_ZM00205_consensus	misclassified_as_CACTA
DTA_ZM00284_consensus	misclassified_as_CACTA
RLX_teki_AC202867-7492	rDNA-contained
RLG_ajajog_AC191578_3186	A_RLG_nested_in_RLC_ajajog_AC191578_3186
RLC_iwim_AC203300_7761	misclassified_RLG_duplicated
RLC_kupu_AC216069_13264	misclassified_RLG_duplicated
RLX_pute_AC197188_5467	duplicated_RLC_pute_AC197188_5467
RLX_votaed_AC215881_13209	duplicated_RLC_votaed_AC215881_13209
RLC_votaed_AC215881_13209	5-6_LTR_nested_together
RLX_bobeg_AC193485_3670	5_LTR_nested_together

perl bin/output_by_list.pl 1 history/maizeTE11212019.ori3 1 history/removal.list -FA -ex > history/maizeTE11212019.ori3.cln

5. PPP_PPO_AC185414 is changed to DTH_PPO_AC185414 manually

6. Update LTR classifications

perl -nle '($info, $cla)=(split)[0,2]; my $oldcla=$1 if $info=~/^([A-Z]+)_/; $cla=~s/EnSpm_CACTA/DTC/; $cla=~s/hAT/DTA/; $cla=~s/PIF_Harbinger/DTH/; $cla=~s/MuDR_Mutator/DTM/; $cla=~s/Tc1_Mariner/DTT/; $cla=~s/Gypsy/RLG/; $cla=~s/Copia/RLC/; next unless /LTR/; my $info_new=$info; $info_new=~s/$oldcla/$cla/; print "$info|$info_new" if $cla ne $oldcla' history/maizeTE11212019.ori3.rexdb.cls.tsv > history/maizeTE11212019.ori3.rexdb.cls.tsv.LTR

for i in cat history/maizeTE11212019.ori3.rexdb.cls.tsv.LTR; do perl -i -slane 'my ($old, $new)=(split /\|/, $info); s/$old/$new/; print $_' -- -info=$i history/maizeTE11212019.ori3.cln; done

7. Convert sequence names to RepeatMasker format

perl -nle 'my $id=(split)[0]; $id=~s/RLC_(.*)/$1#LTR\/Copia/; $id=~s/RLG_(.*)/$1#LTR\/Gypsy/; $id=~s/RLX_(.*)/$1#LTR\/unknown/; $id=~s/DHH_(.*)/$1#DNA\/Helitron/; $id=~s/DTA_(.*)/$1#DNA\/DTA/; $id=~s/DTC_(.*)/$1#DNA\/DTC/; $id=~s/DTH_(.*)/$1#DNA\/DTH/; $id=~s/DTM_(.*)/$1#DNA\/DTM/; $id=~s/DTT_(.*)/$1#DNA\/DTT/; $id=~s/(RIT_.*)/$1#LINE\/RTE/; $id=~s/(RIL_.*)/$1#LINE\/L1/; $id=~s/(RIX_.*)/$1#LINE\/unknown/; $id=~s/(ZM_CACTA_noncoding.*)/$1#MITE\/DTC/; $id=~s/(ZM_Stowaway.*)/$1#DNA\/DTT/; $id=~s/(ZM_Tourist.*)/$1#DNA\/DTH/; $id=~s/(ZM_hAT_noncoding.*)/$1#MITE\/DTA/; $id=~s/(RST_.*)/$1#SINE\/tRNA/; print $id' history/maizeTE11212019.ori3.cln > history/maizeTE11222019.ori

8. Rename short TIR (<= 600bp) to MITE

perl bin/rename_MITE.pl history/maizeTE11222019.ori > history/maizeTE11222019.ori.rename

Remove gene sequences

1. Mask gene cds

RepeatMasker -pa 36 -q -no_is -norna -nolow -div 40 -lib history/Zea_mays.AGPv4.cds.all.noTE.fa.mod.cln -cutoff 500 history/maizeTE11222019.ori.rename

perl bin/cleanup_tandem.pl -misschar n -Nscreen 1 -nc 1000 -nr 0.3 -minlen 80 -maxlen 5000000 -cleanN 1 -cleanT 1 -trf 0 -f history/maizeTE11222019.ori.rename.masked > history/maizeTE11222019.ori.rename.nogene

2. Finalize

Manually add the sequence CL569186.1#subtelomere/4-12-1 back to history/maizeTE11222019.ori.rename.nogene

cp history/maizeTE11222019.ori.rename.nogene maizeTE11222019

Updates

01/30/2020

Added four CRM sequences (CRM1-4) contributed by Na Wang from Gernot and Presting (2008).

02/03/2020

Added 3-letter names before all consensus seq IDs. IDs like "ZM00034_consensus" were inherited from the 2014 version MTEC, so kept it unchanged.

perl -nle 's/>(.*)#(.*)\/(.*)/>$3_$1#$2\/$3/; print $_' maizeTE01302020 > maizeTE02032020

02/05/2020

Fix namings

perl -nle 's/>(.*)#(.*)\/(.*)/>$3_$1#$2\/$3/ if /consensus/; s/>(.*)/>$1#LTR\/CRM/ if /CRM/; print $_' history/maizeTE01302020 > maizeTE02052020

MTEC
MTEC copied to clipboard

Metadata

Introduction

Files

Combine the three MTEC libraries + nonTE repeats

1. Reformat sequence IDs

2. Combine sequences with unique IDs

3. Split the library into consensus and others

4. Remove TEs in others that are represented by consensus TEs

5. Remove redundant sequences

6. Remove nonTE repeats and tandem repeats

Improve the combined library

1. Reclassify unknown TEs

2. Find misclassified entries

3. Some LTRs appear to have the same name but different classifications (RLG/RLC/RLX)

4. Manually check misclassified sequences (`list.fa`).

5. PPP_PPO_AC185414 is changed to DTH_PPO_AC185414 manually

6. Update LTR classifications

7. Convert sequence names to RepeatMasker format

8. Rename short TIR (<= 600bp) to MITE

Remove gene sequences

1. Mask gene cds

2. Finalize

Updates

← Metadata

Owner

Metadata

MTEC MTEC copied to clipboard

Metadata

Introduction

Files

Combine the three MTEC libraries + nonTE repeats

1. Reformat sequence IDs

2. Combine sequences with unique IDs

3. Split the library into consensus and others

4. Remove TEs in others that are represented by consensus TEs

5. Remove redundant sequences

6. Remove nonTE repeats and tandem repeats

Improve the combined library

1. Reclassify unknown TEs

2. Find misclassified entries

3. Some LTRs appear to have the same name but different classifications (RLG/RLC/RLX)

4. Manually check misclassified sequences (list.fa).

5. PPP_PPO_AC185414 is changed to DTH_PPO_AC185414 manually

6. Update LTR classifications

7. Convert sequence names to RepeatMasker format

8. Rename short TIR (<= 600bp) to MITE

Remove gene sequences

1. Mask gene cds

2. Finalize

Updates

← Metadata

Owner

Metadata

MTEC
MTEC copied to clipboard

4. Manually check misclassified sequences (`list.fa`).