ProcessRepeats .out file error?
Hello,
I'm using RepeatMasker version 4.1.0, and as per the weblink https://www.animalgenome.org/bioinfo/resources/manuals/RepeatMasker.html, the .out file from ProcessRepeats should look as below with values in the 12th column in parantheses (representing the no. of bases):
But when I ran ProcessRepeats, the resulting .out file looks like it may have output and formatting errors with respect to no. of bases, start, and end of position in repeat (details in points 1 & 2 below). There is also a potential incorrect repeat in the .out file (point 3 below). I've listed the specific issues below and attached the image of the .out file.
- For lines 1 to 3, the parentheses are in column 14 , but for lines 4 and 5 on column 12.
- There is a negative value (-2) for the start position on line 3. When I converted the .out file to .gff and ran gff3validator, it gave me an error saying the negative value on the start position is not correct.
- In the .cat file generated from compiling different repeats, there is no repeat for position 18017103 18017136 on PseudoCM012306.1_chromosome_1 (line 3 below). Rather there is only a partial match for position 18017103: PseudoCM01230 180171039 TAAAAGGGGATTGGTTTTTTCCTGCGGTTCCAGTCTGTTTGTTGGAAGAG 180171088. So why does .out file have a repeat 18017103 18017136 on PseudoCM012306.1_chromosome_1?
Are these problems the result of bugs? If so, what can I do to get correct outputs? I came across these issues when trying to use a species specific repeat file in Maker (after converting .out into a maker .gff3 file format ), but despite several weeks of formatting efforts, I've not been able to get it to work. It is a recurrent problem that many RepeatMasker users are facing, but no forum has been able to give a working solution. Since other gffs from cufflinks and tophat work fine, it seems to me that the original .out file from RepeatMasker may be the issue here.
Thanks.
- For lines 1 to 3, the parentheses are in column 14 , but for lines 4 and 5 on column 12.
This is normal in RepeatMasker's .out format. Matches to the positive (+) strand of the repeat model are formatted as model_start model_end (model_remaining), and matches to the reverse complement (C) are formatted as (model_remaining) model_end model_start.
There is a negative value (-2) for the start position on line 3.
In the .cat file generated from compiling different repeats, there is no repeat for position 18017103 18017136 on PseudoCM012306.1_chromosome_1 (line 3 below). So why does .out file have a repeat 18017103 18017136 on PseudoCM012306.1_chromosome_1?
One of the main differences from the .cat file to the .out file is postprocessing and coordinate adjustment for some groups and known elements. So, there is probably an annotation in the .cat file which is close to that position but not exactly. Line 3 in particular does look wrong since it has apparently been "overextended" past the 5' end of the repeat family, which might indicate a bug in RepeatMasker's postprocessing.
Are these problems the result of bugs? If so, what can I do to get correct outputs?
That particular output line does look like a bug. In order to help us to reproduce and track down the cause of the problem in RepeatMasker, would you be able to send or attach the .cat file (preferably compressed) and the library file used with the -lib option?
As a temporary workaround, you could also replace small negative values in that column with 1 to make the output valid (but perhaps missing the correct annotation by a few base pairs).
I came across these issues when trying to use a species specific repeat file in Maker (after converting .out into a maker .gff3 file format ), but despite several weeks of formatting efforts, I've not been able to get it to work. It is a recurrent problem that many RepeatMasker users are facing, but no forum has been able to give a working solution.
I am sorry to hear you have not been able to get it to work. This is also the first time that I have heard of this issue with negative coordinates, so it's disconcerting to hear that many users are repeatedly facing this. I hope we will be able to find and fix the problem soon.
Hi Jeb,
What email should I send the files to?
Thanks
@githubgig You can send files to me at [email protected], or Robert at [email protected]. The files might be too big for an email attachment; if that is the case we can also download files via Google Drive, box.com, FTP, etc.
Thank you for sending those! I have a few ideas of why these negative coordinates might have been produced, and I will reach out again here and/or by email when we know more.