RepeatMasker RepeatMasker output .tbl file identifies no library elements when using Dfam library

What do you want to know? Hello, I want to identify and mask repeats in a non-reference Brassicaceae genome. I was able to view the list of taxa for which there are libraries in the Dfam database, and besides Arabidopsis (the closest relative to my input genome), there are libraries for tribal and family ancestors, as well as many more up the hierarchy. I ran RepeatMasker first specifying the Brassicaceae library and then the Camelineae (tribal) library. Whichever one I specify, the output .tbl file identifies no library elements at all, only simple repeats Screenshot from 2023-08-21 21-33-01

Helpful context

I used the command

RepeatMasker -species brassicaceae -xsmall CO46.fna

And alternatively:

-species Camelineae

Previously I also ran it with

-species arabidopsis

but after than I began to think I should use only an ancestral taxon. In any case, all three library choices produced an identical output table with no library matches.

Is there a particular genome assembly or organism your question is about? If possible, please provide a link to a publicly available assembly and/or a species name.

The input organism is not available publicly and has no reference assembly. It is an unreferenced biotype of the reference species Camelina sativa.

Have you installed RepBase RepeatMasker Edition for RepeatMasker?

I'm not sure this question applies? I am using a server installation provided by my organization. The RepeatMasker version is 4.1.5. To view available Dfam libraries, I downloaded the same RepeatMasker version as the installation and ran

python famdb.py -i Libraries/Dfam.h5 lineage brassicaceae -ad

That command output taxonomic hierarchy through Camelineae and its descendents.

I want to emphasize I am not attempting to use a custom library, I am requesting search, classification, and masking against supported libraries. Can you please tell me what I am doing wrong?

Thank you, Barbara Dobrin

Aug 22 '23 03:08 bhdobrin

Barbara,

Greetings! The command:

$ python famdb.py -i Libraries/Dfam.h5 lineage brassicaceae -ad

does indeed show the full lineage. However, using these settings, there is a '[0]' behind each species name (with the exception of the root), indicating that there are no records for the species you have indicated. A browse search with their taxonomic search, including uncurated records (with both ancestors and descendants checked), also yields no entries.

Aug 22 '23 15:08 JMStorer

Thank you for the very prompt reply.

Since I posted my comment, I ran the famdb.py command on my organization's RepeatMasker installation path (previously I ran it locally, on my local copy), and the output is the same: all Dfam.hf libraries are empty except root: @.***

Can you explain what I am to make of this? My understanding was that the RepeatMasker installation includes repeat libraries specific to many taxa, including plants. But your email, and my own inspection, indicate that your default Dfam database includes no Eukaryote libraries, let alone plants. Are users expected to import a complete version of Dfam? Also, in the libraries folder, I see another library, called "RepeatMaskerLib.h5". Should I specify this library instead? If no Dfam database version includes plants, and "RepeatMaskerLib" doesn't provide an alternative, should I use a RepBase library? Does any native (i.e., preinstalled) RepeatMasker library include plants, or eukaryotes?

I could run RepeatMasker specifying the species "root", but for my taxon, I do not know whether there is utility in searching a prokaryote database.

Thank you very much, B.Dobrin

From: JMStorer @.> Sent: Tuesday, August 22, 2023 10:26 AM To: rmhubley/RepeatMasker @.> Cc: Dobrin, Barbara - REE-ARS @.>; Author @.> Subject: Re: [rmhubley/RepeatMasker] RepeatMasker output .tbl file identifies no library elements when using Dfam library (Issue #230)

Barbara,

Greetings! The command:

$ python famdb.py -i Libraries/Dfam.h5 lineage brassicaceae -ad

does indeed show the full lineage. However, using these settings, there is a '[0]' behind each species name (with the exception of the root), indicating that there are no records for the species you have indicated. A browse search with their taxonomic search, including uncurated records (with both ancestors and descendants checked), also yields no entries.

Reply to this email directly, view it on GitHubhttps://github.com/rmhubley/RepeatMasker/issues/230#issuecomment-1688419595, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BCB5WYAN5RXI3WRXSVNHJNTXWTFPVANCNFSM6AAAAAA3ZIENIA. You are receiving this because you authored the thread.Message ID: @.@.>>

This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.

Aug 22 '23 21:08 bhdobrin

Responding again to JMStorer. I am sorry, are you one of the repository owners? Are you another user? Were you trying to advise me on a course of action, or did you simply mean to observe that the default Dfam database is empty of libraries below the root level?

Aug 22 '23 23:08 bhdobrin

Responding again to JMStorer. I am sorry, are you one of the repository owners? Are you another user? Were you trying to advise me on a course of action, or did you simply mean to observe that the default Dfam database is empty of libraries below the root level?

I am another user! I used to be a Dfam curator, and when I saw your post, I wanted to help figure out the cause of your issue. I apologize if this was not the case. Unless the Repbase repository is also installed as part of RepeatMasker, you will likely not be able to detect more than what you already have. You can also run famdb.py on the RepeatMaskerLib.h5 to see if there are different entries there.

Aug 23 '23 00:08 JMStorer

Reponding to JMStorer: Thank you. Running famdb.py on RepeatMaskerLib.h5 shows that that database, too, is empty of contents below the root level. (Output looks identical to the Dfam.h5 output.) However, the main RepeatMasker directory includes many files called RepeatMaskerLib, including a database and flat file. Screenshot from 2023-08-24 09-39-03

The flat file is not empty of taxa (I see many eukaryotic taxa.) What is the "RepeatMaskerLib" collection, as in where did it come from? How would one direct RepeatMasker to search that library and not Dfam? There is no information about this in the help file.

Also, why is there a RepBase.embl file in the main directory? How would one use that file with RepeatMasker? Are there any instructions? I cannot find anything about this in the help file.

Screenshot from 2023-08-24 09-40-38

I asked my organization to inspect our installed Dfam database, and they determined they had neglected to install the supplement (full Dfam library) and will do so. I would prefer to use RepBase, however. My organization is not allowed to install RepBase for general use, but if I have a copy, they can configure RepBase to use my copy or else help me execute RepeatMasker such that it searches my copy.

However, it is not at all clear how to actually use RepBase with RepeatMasker. The help materials contain contradictory information about installing and/or using RepBase. The INSTALL file instructs to install the RepBase RepeatMaskerEdition (the implication is that the program must then be configured with configure.pl to use RepeatMaskerEdition.) Contradicting this, the help file instructs not to use RepeatMaskerEdition but to use the RepBase fasta version with the -lib option (the option for custom libraries.)

But it is still not clear how to point the -lib flag to RepBase. I have a year-old copy of the RepBase flat fasta library. It appears to consist entirely of a set taxon-specific fasta files inside a directory. Assuming my copy isn't missing anything, all of the files are in fasta format. I tried pointing -dir to one of the flat fasta files. The program complains that the hmm engine won't work with that file. But which engine am I supposed to use? There is nothing in the help file about this. I tried the -libdir option, pointing it to the full directory with all the fasta files. The program fails after about 600 lines, complaining that it "needs a path for a famdb file!" Whatever the program is expecting, it is not finding it in the RepBase fasta library.

The help file also does not provide any general information as to the expected format of a "custom library", other than to say the format should be fasta. But as we have seen, the program does not work if you give it a fasta file. I understand that RepeatModeler libraries will probably work, but some information about format requirements would be helpful. There are many sources of repeat libraries.

Is there anyone who knows how to run this program with RepBase?

Thanks for replying to my questions.

Aug 24 '23 17:08 bhdobrin

Hi @bhdobrin, let me see if I can help. There is quite a bit here so let me start from the top. In the current release of Dfam (3.7) we do not have any TE families specific to Brassicaceae. I am ignoring the nine cloning vector sequences that are used when looking for sequencing contamination. For instance running famdb.py in a RepeatMasker installation with the full Dfam 3.7 (curated + uncurated) is below. The brackets indicate the number of families that have been assigned to that taxon.

% cd RepeatMasker
% ./famdb.py lineage Brassicaceae -ad
1 root [9]                       <-----The nine families mentioned above.
└─131567 cellular organisms [0]
  └─2759 Eukaryota [0]
    └─33090 Viridiplantae [0]
      └─35493 Streptophyta [0]
        └─131221 Streptophytina [0]
          └─3193 Embryophyta [0]
            └─58023 Tracheophyta [0]
              └─78536 Euphyllophyta [0]
                └─58024 Spermatophyta [0]
                  └─3398 Magnoliopsida [0]
                    └─1437183 Mesangiospermae [0]
                      └─71240 eudicotyledons [0]
                        └─91827 Gunneridae [0]
                          └─1437201 Pentapetalae [0]
                            └─71275 rosids [0]
                              └─91836 malvids [0]
                                └─3699 Brassicales [0]
                                  └─3700 Brassicaceae [0]
                                    ├─980083 Camelineae [0]
                                    │ ├─3701 Arabidopsis [0]

Now, if I run the same query on a RepeatMasker installation with Dfam 3.7 + RepBase RepeatMasker Edition 20181026 I get:

% ./famdb.py lineage Brassicaceae -ad
1 root [9]
└─131567 cellular organisms [0]
  └─2759 Eukaryota [0]
    └─33090 Viridiplantae [2]
      └─35493 Streptophyta [0]
        └─131221 Streptophytina [0]
          └─3193 Embryophyta [25]
            └─58023 Tracheophyta [0]
              └─78536 Euphyllophyta [0]
                └─58024 Spermatophyta [0]
                  └─3398 Magnoliopsida [0]
                    └─1437183 Mesangiospermae [0]
                      └─71240 eudicotyledons [0]
                        └─91827 Gunneridae [0]
                          └─1437201 Pentapetalae [0]
                            └─71275 rosids [0]
                              └─91836 malvids [0]
                                └─3699 Brassicales [0]
                                  └─3700 Brassicaceae [0]
                                    ├─980083 Camelineae [0]
                                    │ ├─3701 Arabidopsis [454]
                                    │ │ ├─3702 Arabidopsis thaliana [68]
...
                                    │ │ ├─59689 Arabidopsis lyrata [167]
                                    │ │ │ ├─81972 Arabidopsis lyrata subsp. lyrata [268]
...
                                    │ │ │ ├─63677 Arabidopsis halleri subsp. gemmifera [1]
...
                                    │   └─3719 Capsella bursa-pastoris [1]
                                    │ └─50451 Arabis [1]
                                    │   ├─3708 Brassica napus [2]
                                    │   ├─3710 Brassica nigra [1]
                                    │   ├─3711 Brassica rapa [381]
...
                                    │   ├─3712 Brassica oleracea [23]
                                    │   └─52824 Brassica carinata [1]
...
                                        └─98039 Schrenkiella parvula [83]

So at this time your best bet is to use an installation with RepBase RepeatMaskerEdition installed.

Now for the installation questions. I am not sure where you found a help document that indicated you shouldn't use RepBase RepeatMaskerEdition (if you have a license for it), if you point us to that I will make sure remedy it. RepeatMasker has always supported the use of custom libraries through the "-lib" option. This option accepts a multi-record FASTA file as input and assumes that the user is providing a curated set of TE consensi that are specific to the organism being annotated. For instance, you wouldn't want to search Arabidopsis with the entire RepBase TE library as this will produce many false positive matches to TE sequences from non-related organisms.

The RepBase RepeatMaskerEdition is a special release of RepBase that allowed a user to automatically annotate a subset of the TE families that are specific to the organism/clade specified using the "-species" option to RepeatMasker. This version of RepBase had to be installed specially using the "./configure" utility provided with RepeatMasker. We stopped producing this version of RepBase years ago when GIRI changed its licensing terms. So you have two options, you could either install RepBase RepeatMaskerEdition using the license you have. Then you could simply search using "./RepeatMasker -species Brassicaceae my_genome.fa". Or you could take the standard RepBase export (FASTA format) that it appears you have and identify by hand the TEs relevant to your search and place that subset in a single FASTA library file. In which case you would search using "./RepeatMasker -lib my_library.fa my_genome.fa".

Here are answers to your other questions:

What is the "RepeatMaskerLib" collection? This is a file auto-generated by the RepeatMasker "./configure" utility and is not meant to be used directly

How would one direct RepeatMasker to search that library and not Dfam? You wouldn't. This file is not meant to be used directly by users.

I tried pointing -dir to one of the flat fasta files The RepeatMasker "-dir" option is used to specify the directory to store the output files in. It will have no effect on the use of libraries with RepeatMasker.

The program complains that the hmm engine won't work with that file. But which engine am I supposed to use? If you are using consensus sequences your engine options are "crossmatch", and "ncbi" (for RMBlastN). If you are using profile Hidden Markov Models than you would use "hmmer" as the engine. It appears that your installation defaults to using "hmmer" at the search engine. To switch to a consensus-based search engine simply use "-engine ncbi" or "-engine crossmatch".

I tried the -libdir option, pointing it to the full directory with all the fasta files This option is provided for the Conda package maintainers and under normal circumstances is not used. If you want to use a custom library simply create a FASTA file with all your families in one file and use "-lib my_library.fa".

Also, why is there a RepBase.embl file in the main directory? Actually, if you look closely the file is named "RepBaseEMBL.pm". This is part of the RepeatMasker codebase and not a datafile.

The help file also does not provide any general information as to the expected format of a "custom library", other than to say the format should be fasta. But as we have seen, the program does not work if you give it a fasta file. I am not aware of a problem running RepeatMasker with a custom library. If you have two FASTA files, one containing your TE consensi and the other containing genomic sequences you should be able to annotate the genomic sequences using "./RepeatMasker -engine ncbi -lib my_library.fa my_genome.fa". If you get an error message from this, please post it here.

Aug 24 '23 19:08 rmhubley

Hi Dr. Hubley, Thank you for replying. I suppose I misread this paragraph: Screenshot from 2023-08-23 16-02-14 It is possible to read "for newer versions of RepBase" as "At present", and to read in to the second sentence that the Edition maintained through 2019 should no longer be used. The paragraph does not say that, and I admit I misread.

I did not describe quite correctly how I tried to use a fasta file as my repeat database. In the message above I meant to type "-lib" not "-dir". I was not aware the software supports a "-dir" flag and did not attempt to use it. When I used the command ".[...] -lib my_fasta_library.fasta [....]" I received the error about the search engine because, as you noted, the software tried to use the default hmmr engine. I had no knowledge about the ncbi option.

If I could ask 2 questions. I started RepeatMasker earlier (it is still running) with RepBase's A. thaliana - specific fasta file as my repeat library. Will the program build a .tbl classification file on its own, or should I generate one with buildSummary.pl? If I must use buildSummary, may I ask what would be appropriate arguments? The help instructs to use "-species my_taxon", but the species option is specific to non-fasta builds and I suspect I should instead use -lib and the name of my library.

I have the same question in the case where I merge the thaliana sister species lyrata (and maybe others in Brassicaceae) with thaliana sequences to make a "custom" library. For this library, will the program generate a summary table or should I build one with buildSummary.pl? (I expect the script will behave the same with the RepBase original file and modified file but thought I would ask.)

Thank you for your help. -B. Dobrin

Aug 25 '23 02:08 bhdobrin

@bhdobrin hi, has your problem been solved yet? I am having the same problem as you even after updating repbase database.

Nov 01 '23 01:11 zc1992zhoucong

Yes, for me the keys to make it work were:

1)use the -engine ncbi flag as in Dr. Hubley's recommendation above. This is apparently the engine suited to a flat file fasta library, or was for mine, and

2)format my custom library headers as described in the help file, where it says:

The recommended format for IDs in a custom library is:

repeatname#class/subclass or simply repeatname#class

I used a subset of the RepBase files (merged into a single file) and accepted their classifications, which differ from the RepeatMasker native classifications (or they may get them from Dfam, I'm not sure). I made the taxon the subclass, which worked fine for our purposes.

c) after the run, use buildSummary.pl on the *.out file to get the summary table. This summarizes from the classifications you have given it. If instead you wanted RepeatMasker to make the summary table during the run without running buildSummary, you'd have to rename all your classifiers to match their scheme. With buildSummary, you keep the scheme in your library.

I think the disconnect here is people don't know to use the ncbi engine and aren't aware the headers have to exactly match in format and content to get the table they're expecting at the run's end. I hope all of this works for you.

Nov 02 '23 17:11 bhdobrin