RepeatMasker
RepeatMasker copied to clipboard
Which data should I use for forecasting
Hello, I have a doubt about using the software. I got < Database > by using repeatmodeler software families.fa file. When I wanted to use the RepeatMasker software, I found that the - species option provided had the same species data information as me. But I can't use both lib and species at the same time. In this case, which one should I use?
Hello! I have extracted the repetitive sequences of the species I used in dfam through famdb. py, and the size in capital is about 800K. I can now combine the repeatmodeler predictions with the proposed repeats through cat, which is no problem. But I hope to combine these two parts with repbase repetitive sequence to form a relatively complete FASTA file. I'd like to ask how I can convert the EMBL format to the FASTA format we want to use? Or is there a way to complete such an operation?
Hi! I think some of your question about lib
and species
has been answered before: https://github.com/rmhubley/RepeatMasker/issues/5#issuecomment-392877654. Does this help?
EMBL contains the same information as FASTA and more, so converting it is pretty straightforward with a program or script. One example with BioPython is demonstrated on http://sequenceconversion.bugaco.com/converter/biology/sequences/embl_to_fasta.php. But, I don't have any specific programs I would recommend for this off the top of my head. If you do find a method that works well, I encourage you to leave a note here in case it helps the next person with the same question!
Thank you very much for your reply! I used addrepbase. PL to merge Repbase database with dfam database, found pig of my species through famdb. py, and then extracted repeated sequence of pig. It's about 1m, and then we use cat to change it into a new set with the result predicted by repeatmodule. Is that ok?
Is that ok?
This is a difficult question because it depends on your particular project or goals.
The answer I linked before suggests two separate RepeatMasker runs: one with -species
, and a second run on top of that masked genome with your own -lib
. With the -species
run separately first, preference is given to the already known repeat families in Dfam and RepBase.
If you combine the RepBase RepeatMasker Edition sequences with the sequences from RepeatModeler, and use the combined library with -lib
, you will get somewhat different results. For example, any repeats that RepeatModeler discovered could be duplicates of something that was already in the library, so you might end up with very similar biological elements annotated as being one of two or more different things. You will also lose some of the fine-tuning that -species
provides, which is adjusted for each TE family's relative age and diversity.
Thank you very much for your reply. Your reply is predictable. I did get bad results. So if I use the Lib of the RepBase database -species and get good results, can I not use the part predicted by the RepeatModeler?
I am not sure I understand your question. You could use the family models from RepeatModeler for a second round of masking with -lib
as I described above, if you wanted to. But depending on the exact species and the coverage already in RepBase RepeatMasker Edition, this might not be necessary.
Thank you for your patient and constructive reply. I will try different methods again to see the specific differences between different methods!