RepeatMasker
RepeatMasker copied to clipboard
Differences in elements between repeatmasker versions
Hi,
We are masking a teleost genome and have done so with older (4.0.7) and the newest (4.1.1) version of RepeatMasker. One thing we have noticed is that ~ the same percentage of the genome is being masked between the versions however the percentage of unclassified elements has nearly tripled between the old and the new. Additionally the types of elements that are being classified are totally different (e.g. TcMar-Tigger was listed in the older version but absent in the new and PiggyBac is in the new but not the old). I'm pretty naive when it comes to the repetitive element world so there is likely an easy explanation.
Thanks for any explanation!
That is a bit surprising, but might be explained by changes in the libraries of known repeats. What have you been using for the -lib
or -species
flag, and was it the same for both runs? Did you also have RepBase RepeatMasker Edition installed for both runs, and which version(s) of that?
We created custom libraries and used the -lib flag by running repeat modeler and combining the outputted consensus library with a teleost specific library retrieved by using queryRepeatDatabase.pl in version 4.0.7 and famdb.py with version 4.1.1. This was done separately with both versions. I think the differences that we are seeing is coming from the differences in the Dfam libraries that are different in the two versions of Masker.
We did notice a large increase of repeats in the new teleost libraries. We did not use RepBase RepeatMasker Edition.
I think in that case this is a duplicate of #88 - the equivalent famdb.py
command for queryTaxonomyDatabase.pl
needs --include-class-in-name
.
Please reopen this issue if you see significant differences after adding that option!
Sorry that I didn’t clarify this but yes we used —include-class-in-name while extracting the specific library.
On Dec 1, 2020, at 9:46 AM, Jeb Rosen [email protected] wrote:
I think in that case this is a duplicate of #88 https://urldefense.com/v3/__https://github.com/rmhubley/RepeatMasker/issues/88__;!!C5qS4YX3!QCBS2nOpT51H5pn3ioijIJ_c_AqqeZYjDBOQXEicheUoe8L-Fv4q8SXzPG9i1VMbRg$ - the equivalent famdb.py command for queryTaxonomyDatabase.pl needs --include-class-in-name.
Please reopen this issue if you see significant differences after adding that option!
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/rmhubley/RepeatMasker/issues/90*issuecomment-736712807__;Iw!!C5qS4YX3!QCBS2nOpT51H5pn3ioijIJ_c_AqqeZYjDBOQXEicheUoe8L-Fv4q8SXzPG9aZhEL2Q$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AC4HCKOSNPJA5KUHNVGG2HLSSUTVRANCNFSM4UIKCFXQ__;!!C5qS4YX3!QCBS2nOpT51H5pn3ioijIJ_c_AqqeZYjDBOQXEicheUoe8L-Fv4q8SXzPG8wwc3N3g$.
Using "equivalent" libraries these are the output tables. You can see the % masked is about the same however the elements listed are very different.
maker4.1.1_pttt_v1.1.fasta.tbl.txt maker4.70.7_pttt_1.1.fasta.tbl.txt
RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127
This might be relevant: the classifications from RepeatModeler (specifically the RepeatClassifier
program) are informed in part by the configured RepeatMasker library, including RepBase RepeatMasker edition if it is installed in that copy of RepeatMasker.
I think this is exactly what is happening. In the older versions of RepeatModeler and Masker it is querying an older Dfam library (maybe I'm interpreting the webpage wrong but it looks like the Dfam library included in Masker 4.1.1 is an updated version, Dfam 3.2).
Sorry, I am quickly getting lost in all the possible changes between the different versions. Could you help verify and correct this summary so that all the changes are listed in one place?
Consensus Library 1 ("old") was generated with:
- RepeatMasker version: 4.0.7
- RepBase RepeatMasker Edition version: RepBase-20170127
- RepeatModeler version:
- Total number of families:
Consensus Library 2 ("new", more "Unclassified") was generated with:
- RepeatMasker version: 4.1.1
- RepBase RepeatMasker Edition version: Not installed
- RepeatModeler version:
- Total number of families: (increased?)
I have two ideas that might explain the difference in classification, but I am not sure about the change in number of discovered families:
-
We recently identified a bug in RepeatMasker 4.1.1 which affects classifications from
RepeatClassifier
that are based on similarity to TEs. (Classifications based on homology with known protein sequences is unaffected). The bug can be rectified by running this command inside theRepeatMasker
directory:./famdb.py -i ./Libraries/RepeatMaskerLib.h5 families --descendants 1 --curated --format fasta_name --include-class-in-name > ./Libraries/RepeatMasker.lib
and then re-runningRepeatClassifier
on the consensus library (re-running all ofRepeatModeler
is not necessary). -
RepBase RepeatMasker Edition, which appears to be installed in the older but not the newer version. This change would "lose" any classifications based on similarity to a consensus sequence in RepBase. To test this change alone, you can compare the original output from RepeatModeler+RepeatMasker 4.0.7+RepBase RM Edition vs the output re-run through
RepeatClassifier
with RepeatModeler+RepeatMasker 4.0.7 without RepBase RM Edition.
Hopefully either or both of these difference explain most of the changes in output.
Consensus Library 1 ("old") was generated with:
- RepeatMasker version: 4.0.7
- RepBase RepeatMasker Edition version: RepBase-20170127
- RepeatModeler version: 1.0.11
- Total number of families: 11 families
Consensus Library 2 ("new", more "Unclassified") was generated with:
RepeatMasker version: 4.1.1 RepBase RepeatMasker Edition version: Not installed RepeatModeler version:2.0.1 Total number of families: (increased?)~18 but totally different than from previous versions of Masker.
Sorry for the long wait.
By my estimate, the latest version of RepBase RepeatMasker Edition (which may be different from 20170127) includes about 2000 more repeat families than Dfam (counting only "curated" / DF records), within the taxon Teleostei
and its descendants. This might make a lot of difference; running RepeatClassifier
standalone twice on the same library but configured with different RepeatMasker installations would confirm this possibility.
I did not consider it before, since I was treating it as a RepeatMasker issue, but it is also likely that the libraries are different in part because RepeatModeler uses a sampling approach and discovered different sets of elements in each run. This can be controlled by setting RepeatModeler's -srand
parameter to the same value as a seed produced by a prior run.