RepeatMasker
RepeatMasker copied to clipboard
ProcessRepeats generates empty/blank "ID" values, which causes other errors. e.g. RM2Bed.py "invalid literal"
RM2Bed.py (v4.1.2) breaks on invalid literal from Mmul10.fa (https://hgdownload.soe.ucsc.edu/goldenPath/rheMac10/bigZips/rheMac10.fa.gz) with the "human" library. I have attached the .out file from a successful RM run on chr2 of the Mmul10 genome, which has produced this issue.
The easiest way to reproduce this is with
RM2Bed.py 12-of-23.fa.out.gz
Python produces the following error.
ValueError: invalid literal for int() with base 10: '*'
Repeatmasker version 4.12 was installed with bioconda. On the following operating system:
LSB Version: :core-4.1-amd64:core-4.1-noarch Distributor ID: CentOS Description: CentOS Linux release 7.9.2009 (Core) Release: 7.9.2009 Codename: Core
Thanks for reporting this problem. Something strange has happened in this output file; some IDs are missing. The first missing ID is on line 3197 of the (uncompressed) file; there seems to be a particularly tricky SVA element there. For a "quick and dirty" fix, I suggest hand-editing that and the nearby lines to put an ID in - perhaps 2673 to match the other nearby SVA fragments.
The cause of this problem was in RepeatMasker
, specifically the ProcessRepeats
program. It should never produce lines with "missing" IDs such as this. Do you have a corresponding .cat
file for this output? If you can send it to us, it should help us to troubleshoot the issue more thoroughly.
Yep. You can find the cat file here: https://eichlerlab.gs.washington.edu/public/wharvey/RM_test/12-of-23.fa.cat.gz. Too large to attach here.
Thanks!
Thank you! I have successfully reproduced the problem with this input file, and I expect that it should help immensely in narrowing down the cause of the error.
This problem has been identified and should be fixed in the upcoming 4.1.3 release.