RepeatMasker icon indicating copy to clipboard operation
RepeatMasker copied to clipboard

Question about Repeat Libraries format?

Open CSU-KangHu opened this issue 4 years ago • 3 comments

RepeatMasker uses Dfam and RepBase libraries to mask the input reference. I have some questions: (1) RepeatMasker has a -lib parameter to use a custom library. Is this library in FASTA format? Does it have any specific format requirements for its header? In the previous version, we extracted library sequences by ./ queryrepeatdatabase.pl - species Drosophila melanogaster. The header of the library is like >pbum-alpha2beta#satellite repbaseid: pbum-alpha2betaxx. Can you explain the meaning of each item? In the current version, we extracted library sequences by ./famdb.py families --format fasta_acc -ad --curated 'Drosophila melanogaster'. The format of the header is >DF0000772.5 name=LSU-rRNA_Hsa @Metazoa [S:55]. Can you explain the meanings? At the same time, why is the size of the library files taken out inconsistent? The library extracted by previous version is 7.2MB, while the current version is 730KB. image

(2) RepeatMasker has a -species parameter to use its built-in library. Is this also a library that uses FASTA format? Is there any difference between the format of this library and that of a custom library?

(3) Dfam library is H5 format, and RepBase library is EMBL format. Are there any differences between the two formats? Is there a standard repeat sequences library format?

CSU-KangHu avatar Jun 21 '21 03:06 CSU-KangHu

(1) RepeatMasker has a -lib parameter to use a custom library. Is this library in FASTA format?

Yes; the format is described further in the repeatmasker.help file (link: https://github.com/rmhubley/RepeatMasker/blob/a58f3130a4fedb7784171a539052277d2cccc690/repeatmasker.help#L227

The -lib option can also be used with a custom .hmm file for the HMMER search engine.

we extracted library sequences by ./famdb.py families --format fasta_acc -ad --curated 'Drosophila melanogaster'

There was a mistake in early versions of the instructions in repeatmasker.help; this command would be better: ./famdb.py families --format fasta_name -ad --curated --include-class-in-name 'Drosophila melanogaster' (note especially the added --include-class-in-name).

At the same time, why is the size of the library files taken out inconsistent? The library extracted by previous version is 7.2MB, while the current version is 730KB.

The two libraries are actually different: ./queryrepeatdatabase.pl -species Drosophila melanogaster is the same as ./queryrepeatdatabase.pl -species Drosophila. Quotes are needed to treat the two words as a single unit:./queryrepeatdatabase.pl -species 'Drosophila melanogaster'.

In a future update, famdb.py will let you skip the quotes because that is more convenient.

(2) RepeatMasker has a -species parameter to use its built-in library. Is this also a library that uses FASTA format? Is there any difference between the format of this library and that of a custom library?

The included library (Dfam) now uses the FamDB format, which supports consensus sequences and pHMMs in the same file and includes indexes for faster retrieval of entries for particular species/clades. Libraries are extracted from this file format and converted to either FASTA or HMM format depending on which search engine is being used; the FASTA format is similar to a custom library.

(3) Dfam library is H5 format, and RepBase library is EMBL format. Are there any differences between the two formats? Is there a standard repeat sequences library format?

FamDB (based on HDF5) is different from EMBL by also including pHMMs in the same file, and FamDB is optimized to be able to search by species or clade (e.g. for RepeatMasker's -species option) only reading the data it needs to instead of reading the entire data file. This is an important feature for the Dfam database, which is growing to include more species and more diversity of species.

jebrosen avatar Jun 22 '21 17:06 jebrosen

hi, I wonder if anyone knows where this file queryrepeatdatabase.pl is?

jingydz avatar Dec 04 '21 06:12 jingydz

@jingydz in previous versions of RepeatMasker, queryRepeatDatabase.pl is located in the RepeatMasker/util/ directory. RepeatMasker 4.1.1 and later use a new library file format; RepeatMasker/famdb.py can be used to access this library as queryRepeatDatabase.pl did in the past.

jebrosen avatar Dec 07 '21 16:12 jebrosen