RepeatMasker ProcessRepeats run time/progress

Hi, ProcessRepeats crashed on our computing cluster due to 'out-of-memory'. So now I'm running ProcessRepeats separately on the final cat file (3.9 Gb). However, there is no indication of progress or how long it takes and it has been running for > 10 hours. How long does ProcessRepeats normally take, and is there any way to track its progress?

Thanks.

Jan 20 '21 23:01 gevro

However, there is no indication of progress or how long it takes and it has been running for > 10 hours.

10 hours is a very long time without any output. You should see a progress indication similar to this:

processing output: 
cycle 1 ..............................
cycle 2 ..............................
cycle 3 ................

Are you running ProcessRepeats directly, or did you redirect its output or run it on a job queue system or something else that might have intercepted the progress output?

Jan 21 '21 19:01 jebrosen

I'm running it simply like this: ProcessRepeats genome.fasta.cat.gz

I'm not seeing any output.

Just running ProcessRepeats shows that the tool is ok:

ProcessRepeats
No cat file indicated

NAME
    ProcessRepeats - Post process results from RepeatMasker and produce an
    annotation file.

SYNOPSIS
      ProcessRepeats [-options] <RepeatMasker *.cat file>

DESCRIPTION
    The options are:

    -h(elp)
        Detailed help

    -species <query species>
        Post process RepeatMasker results run on sequence from this species.
        Default is human.

    -lib <libfile>
        Skips most processing, does not produce a .tbl file unless the
        custome library is in the ">name#class" format.

    -nolow
        Does not display simple repeats or low_complexity DNA in the
        annotation.

    -noint
        Skips steps specific to interspersed repeats, saving lots of time.

    -lcambig
        Outputs ambiguous DNA transposon fragments using a lower case name.
        All other repeats are listed in upper case. Ambiguous fragments
        match multiple repeat elements and can only be called based on
        flanking repeat information.

    -u  Creates an untouched annotation file besides the manipulated file.

    -xm Creates an additional output file in cross_match format (for
        parsing).

    -ace
        Creates an additional output file in ACeDB format.

    -gff
        Creates an additional Gene Feature Finding format.

    -poly
        Creates an output file listing only potentially polymorphic simple
        repeats.

    -no_id
        Leaves out final column with unique number for each element (was
        default).

    -excln
        Calculates repeat densities excluding long stretches of Ns in the
        query.

    -orf2
        Results in sometimes negative coordinates for L1 elements; all L1
        subfamilies are aligned over the ORF2 region, sometimes improving
        interpretation of data.

    -a  Shows the alignments in a .align output file.

    -maskSource <originalSeqenceFile>
        Instructs ProcessRepeats to mask the sequence file using the
        annotation.

    -x  Mask repeats with a lower case 'x'.

    -xsmall
        Mask repeats by making the sequence lowercase.

SEE ALSO
        RepeatMasker, Crossmatch, Blast

COPYRIGHT
    Copyright 2002-2012 Arian Smit, Robert Hubley, Institute for Systems
    Biology

AUTHORS
    Arian Smit <[email protected]>

    Robert Hubley <[email protected]>

Jan 21 '21 19:01 gevro

Since you're not seeing any output yet, I assume it is still parsing the input file - you can verify this with e.g. top; you should see both ProcessRepeats and gzip running. Unfortunately I expect it to be fairly difficult to improve this situation; it's not easy to "guess" how long the input file will end up being.

@rmhubley do you happen to recall how long ProcessRepeats takes on, say, the human genome, which I see produced a ~8GB gzipped .cat file?

Jan 22 '21 18:01 jebrosen

RepeatMasker RepeatMasker copied to clipboard

ProcessRepeats run time/progress

RepeatMasker
RepeatMasker copied to clipboard