RepeatMasker
RepeatMasker copied to clipboard
ProcessRepeats run time/progress
Hi, ProcessRepeats crashed on our computing cluster due to 'out-of-memory'. So now I'm running ProcessRepeats separately on the final cat file (3.9 Gb). However, there is no indication of progress or how long it takes and it has been running for > 10 hours. How long does ProcessRepeats normally take, and is there any way to track its progress?
Thanks.
However, there is no indication of progress or how long it takes and it has been running for > 10 hours.
10 hours is a very long time without any output. You should see a progress indication similar to this:
processing output:
cycle 1 ..............................
cycle 2 ..............................
cycle 3 ................
Are you running ProcessRepeats
directly, or did you redirect its output or run it on a job queue system or something else that might have intercepted the progress output?
I'm running it simply like this: ProcessRepeats genome.fasta.cat.gz
I'm not seeing any output.
Just running ProcessRepeats shows that the tool is ok:
ProcessRepeats
No cat file indicated
NAME
ProcessRepeats - Post process results from RepeatMasker and produce an
annotation file.
SYNOPSIS
ProcessRepeats [-options] <RepeatMasker *.cat file>
DESCRIPTION
The options are:
-h(elp)
Detailed help
-species <query species>
Post process RepeatMasker results run on sequence from this species.
Default is human.
-lib <libfile>
Skips most processing, does not produce a .tbl file unless the
custome library is in the ">name#class" format.
-nolow
Does not display simple repeats or low_complexity DNA in the
annotation.
-noint
Skips steps specific to interspersed repeats, saving lots of time.
-lcambig
Outputs ambiguous DNA transposon fragments using a lower case name.
All other repeats are listed in upper case. Ambiguous fragments
match multiple repeat elements and can only be called based on
flanking repeat information.
-u Creates an untouched annotation file besides the manipulated file.
-xm Creates an additional output file in cross_match format (for
parsing).
-ace
Creates an additional output file in ACeDB format.
-gff
Creates an additional Gene Feature Finding format.
-poly
Creates an output file listing only potentially polymorphic simple
repeats.
-no_id
Leaves out final column with unique number for each element (was
default).
-excln
Calculates repeat densities excluding long stretches of Ns in the
query.
-orf2
Results in sometimes negative coordinates for L1 elements; all L1
subfamilies are aligned over the ORF2 region, sometimes improving
interpretation of data.
-a Shows the alignments in a .align output file.
-maskSource <originalSeqenceFile>
Instructs ProcessRepeats to mask the sequence file using the
annotation.
-x Mask repeats with a lower case 'x'.
-xsmall
Mask repeats by making the sequence lowercase.
SEE ALSO
RepeatMasker, Crossmatch, Blast
COPYRIGHT
Copyright 2002-2012 Arian Smit, Robert Hubley, Institute for Systems
Biology
AUTHORS
Arian Smit <[email protected]>
Robert Hubley <[email protected]>
Since you're not seeing any output yet, I assume it is still parsing the input file - you can verify this with e.g. top
; you should see both ProcessRepeats
and gzip
running. Unfortunately I expect it to be fairly difficult to improve this situation; it's not easy to "guess" how long the input file will end up being.
@rmhubley do you happen to recall how long ProcessRepeats takes on, say, the human genome, which I see produced a ~8GB gzipped .cat file?