RepeatMasker icon indicating copy to clipboard operation
RepeatMasker copied to clipboard

Differences in elements between repeatmasker versions

Open fishes2catch opened this issue 4 years ago • 10 comments

Hi,

We are masking a teleost genome and have done so with older (4.0.7) and the newest (4.1.1) version of RepeatMasker. One thing we have noticed is that ~ the same percentage of the genome is being masked between the versions however the percentage of unclassified elements has nearly tripled between the old and the new. Additionally the types of elements that are being classified are totally different (e.g. TcMar-Tigger was listed in the older version but absent in the new and PiggyBac is in the new but not the old). I'm pretty naive when it comes to the repetitive element world so there is likely an easy explanation.

Thanks for any explanation!

fishes2catch avatar Dec 01 '20 01:12 fishes2catch

That is a bit surprising, but might be explained by changes in the libraries of known repeats. What have you been using for the -lib or -species flag, and was it the same for both runs? Did you also have RepBase RepeatMasker Edition installed for both runs, and which version(s) of that?

jebrosen avatar Dec 01 '20 01:12 jebrosen

We created custom libraries and used the -lib flag by running repeat modeler and combining the outputted consensus library with a teleost specific library retrieved by using queryRepeatDatabase.pl in version 4.0.7 and famdb.py with version 4.1.1. This was done separately with both versions. I think the differences that we are seeing is coming from the differences in the Dfam libraries that are different in the two versions of Masker.

We did notice a large increase of repeats in the new teleost libraries. We did not use RepBase RepeatMasker Edition.

fishes2catch avatar Dec 01 '20 15:12 fishes2catch

I think in that case this is a duplicate of #88 - the equivalent famdb.py command for queryTaxonomyDatabase.pl needs --include-class-in-name.

Please reopen this issue if you see significant differences after adding that option!

jebrosen avatar Dec 01 '20 17:12 jebrosen

Sorry that I didn’t clarify this but yes we used —include-class-in-name while extracting the specific library.

On Dec 1, 2020, at 9:46 AM, Jeb Rosen [email protected] wrote:

I think in that case this is a duplicate of #88 https://urldefense.com/v3/__https://github.com/rmhubley/RepeatMasker/issues/88__;!!C5qS4YX3!QCBS2nOpT51H5pn3ioijIJ_c_AqqeZYjDBOQXEicheUoe8L-Fv4q8SXzPG9i1VMbRg$ - the equivalent famdb.py command for queryTaxonomyDatabase.pl needs --include-class-in-name.

Please reopen this issue if you see significant differences after adding that option!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/rmhubley/RepeatMasker/issues/90*issuecomment-736712807__;Iw!!C5qS4YX3!QCBS2nOpT51H5pn3ioijIJ_c_AqqeZYjDBOQXEicheUoe8L-Fv4q8SXzPG9aZhEL2Q$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AC4HCKOSNPJA5KUHNVGG2HLSSUTVRANCNFSM4UIKCFXQ__;!!C5qS4YX3!QCBS2nOpT51H5pn3ioijIJ_c_AqqeZYjDBOQXEicheUoe8L-Fv4q8SXzPG8wwc3N3g$.

fishes2catch avatar Dec 01 '20 17:12 fishes2catch

Using "equivalent" libraries these are the output tables. You can see the % masked is about the same however the elements listed are very different.

maker4.1.1_pttt_v1.1.fasta.tbl.txt maker4.70.7_pttt_1.1.fasta.tbl.txt

fishes2catch avatar Dec 02 '20 01:12 fishes2catch

RepeatMasker Combined Database: Dfam_Consensus-20170127, RepBase-20170127

This might be relevant: the classifications from RepeatModeler (specifically the RepeatClassifier program) are informed in part by the configured RepeatMasker library, including RepBase RepeatMasker edition if it is installed in that copy of RepeatMasker.

jebrosen avatar Dec 04 '20 22:12 jebrosen

I think this is exactly what is happening. In the older versions of RepeatModeler and Masker it is querying an older Dfam library (maybe I'm interpreting the webpage wrong but it looks like the Dfam library included in Masker 4.1.1 is an updated version, Dfam 3.2).

fishes2catch avatar Dec 07 '20 20:12 fishes2catch

Sorry, I am quickly getting lost in all the possible changes between the different versions. Could you help verify and correct this summary so that all the changes are listed in one place?


Consensus Library 1 ("old") was generated with:

  • RepeatMasker version: 4.0.7
  • RepBase RepeatMasker Edition version: RepBase-20170127
  • RepeatModeler version:
  • Total number of families:

Consensus Library 2 ("new", more "Unclassified") was generated with:

  • RepeatMasker version: 4.1.1
  • RepBase RepeatMasker Edition version: Not installed
  • RepeatModeler version:
  • Total number of families: (increased?)

I have two ideas that might explain the difference in classification, but I am not sure about the change in number of discovered families:

  • We recently identified a bug in RepeatMasker 4.1.1 which affects classifications from RepeatClassifier that are based on similarity to TEs. (Classifications based on homology with known protein sequences is unaffected). The bug can be rectified by running this command inside the RepeatMasker directory: ./famdb.py -i ./Libraries/RepeatMaskerLib.h5 families --descendants 1 --curated --format fasta_name --include-class-in-name > ./Libraries/RepeatMasker.lib and then re-running RepeatClassifier on the consensus library (re-running all of RepeatModeler is not necessary).

  • RepBase RepeatMasker Edition, which appears to be installed in the older but not the newer version. This change would "lose" any classifications based on similarity to a consensus sequence in RepBase. To test this change alone, you can compare the original output from RepeatModeler+RepeatMasker 4.0.7+RepBase RM Edition vs the output re-run through RepeatClassifier with RepeatModeler+RepeatMasker 4.0.7 without RepBase RM Edition.

Hopefully either or both of these difference explain most of the changes in output.

jebrosen avatar Dec 07 '20 22:12 jebrosen

Consensus Library 1 ("old") was generated with:

  • RepeatMasker version: 4.0.7
  • RepBase RepeatMasker Edition version: RepBase-20170127
  • RepeatModeler version: 1.0.11
  • Total number of families: 11 families

Consensus Library 2 ("new", more "Unclassified") was generated with:

RepeatMasker version: 4.1.1 RepBase RepeatMasker Edition version: Not installed RepeatModeler version:2.0.1 Total number of families: (increased?)~18 but totally different than from previous versions of Masker.

fishes2catch avatar Dec 22 '20 18:12 fishes2catch

Sorry for the long wait.

By my estimate, the latest version of RepBase RepeatMasker Edition (which may be different from 20170127) includes about 2000 more repeat families than Dfam (counting only "curated" / DF records), within the taxon Teleostei and its descendants. This might make a lot of difference; running RepeatClassifier standalone twice on the same library but configured with different RepeatMasker installations would confirm this possibility.

I did not consider it before, since I was treating it as a RepeatMasker issue, but it is also likely that the libraries are different in part because RepeatModeler uses a sampling approach and discovered different sets of elements in each run. This can be controlled by setting RepeatModeler's -srand parameter to the same value as a seed produced by a prior run.

jebrosen avatar Feb 08 '21 19:02 jebrosen