RepeatMasker
RepeatMasker copied to clipboard
instability in repeat names/class/family
We're seeing instability in repeat names and class/family for otherwise identical repeats.
Here's a partial diff of two runs on the same sequence (GenBank GL456393.1, from mouse), both with: RepeatMasker version open-4.0.8 RepeatMasker Combined Database: Dfam_Consensus-20181026, RepBase-20181026
5363322,5363336c5363322,5363336
< 182 23.500 0.000 6.400 GL456393.1 20035 20084 (35627) + AmnSINE1 SINE/5S-Deu-L2 159 205 (370) 4747269
< 12489 9.800 1.000 0.100 GL456393.1 26166 28006 (27705) + L1MdF_V LINE/L1 3043 4901 (1675) 4747270
< 997 14.000 0.000 0.000 GL456393.1 27992 28148 (27563) + L1MdF_V LINE/L1 5429 5585 (994) 4747270 *
< 7239 8.400 0.700 1.100 GL456393.1 28151 29142 (26569) + L1MdF_V LINE/L1 5588 6575 (4) 4747270
< 13 27.200 0.000 4.000 GL456393.1 39513 39564 (16147) + (ACAGT)n Simple_repeat 1 50 (0) 4747271
< 12 22.500 1.900 6.000 GL456393.1 44988 45039 (10672) + (ACAGT)n Simple_repeat 1 50 (0) 4747272
< 2712 5.200 1.200 6.200 GL456393.1 46531 46939 (8772) + RLTR10 LTR/ERVK 1 390 (0) 4747273
< 4590 17.600 1.300 0.900 GL456393.1 46940 47633 (8078) + RLTR10-int LTR/ERVK 302 998 (989) 4747274
< 373 25.800 3.600 3.600 GL456393.1 47691 47827 (7884) + RLTR10-int LTR/ERVK 1458 1594 (5987) 4747274
< 272 26.900 0.000 0.000 GL456393.1 47866 47932 (7779) + RLTR10-int LTR/ERVK 1923 1989 (4167) 4747274 *
< 8521 1.700 0.000 0.400 GL456393.1 47910 48908 (6803) + RLTR10-int LTR/ERVK 990 1984 (3) 4747275
< 2815 2.400 0.200 21.000 GL456393.1 48911 49381 (6330) + RLTR10 LTR/ERVK 1 390 (0) 4747275
< 2072 2.700 1.600 0.000 GL456393.1 49392 49648 (6063) + RLTR10-int LTR/ERVK 41 301 (1686) 4747275
< 38 5.900 5.600 1.300 GL456393.1 53036 53107 (2604) + (ACTGTGACACC)n Simple_repeat 1 75 (0) 4747276
< 13 30.100 1.400 4.200 GL456393.1 54216 54289 (1422) + (ACAGT)n Simple_repeat 1 72 (0) 4747277
---
> 182 23.500 0.000 6.400 GL456393.1 20035 20084 (35627) + GSAT_MM Satellite/5S-Deu-L2 159 205 (370) 4747269
> 12489 9.800 1.000 0.100 GL456393.1 26166 28006 (27705) + GSAT_MM Satellite/L1 3043 4901 (1675) 4747270
> 997 14.000 0.000 0.000 GL456393.1 27992 28148 (27563) + GSAT_MM Satellite/L1 5429 5585 (994) 4747270 *
> 7239 8.400 0.700 1.100 GL456393.1 28151 29142 (26569) + GSAT_MM Satellite/L1 5588 6575 (4) 4747270
> 13 27.200 0.000 4.000 GL456393.1 39513 39564 (16147) + (ACAGT)n Satellite/Simple_repeat 1 50 (0) 4747271
> 12 22.500 1.900 6.000 GL456393.1 44988 45039 (10672) + (ACAGT)n Satellite/Simple_repeat 1 50 (0) 4747272
> 2712 5.200 1.200 6.200 GL456393.1 46531 46939 (8772) + GSAT_MM Satellite/ERVK 1 390 (0) 4747273
> 4590 17.600 1.300 0.900 GL456393.1 46940 47633 (8078) + GSAT_MM Satellite/ERVK 302 998 (989) 4747274
> 373 25.800 3.600 3.600 GL456393.1 47691 47827 (7884) + GSAT_MM Satellite/ERVK 1458 1594 (5987) 4747274
> 272 26.900 0.000 0.000 GL456393.1 47866 47932 (7779) + GSAT_MM Satellite/ERVK 1923 1989 (4167) 4747274 *
> 8521 1.700 0.000 0.400 GL456393.1 47910 48908 (6803) + GSAT_MM Satellite/ERVK 990 1984 (3) 4747275
> 2815 2.400 0.200 21.000 GL456393.1 48911 49381 (6330) + GSAT_MM Satellite/ERVK 1 390 (0) 4747275
> 2072 2.700 1.600 0.000 GL456393.1 49392 49648 (6063) + GSAT_MM Satellite/ERVK 41 301 (1686) 4747275
> 38 5.900 5.600 1.300 GL456393.1 53036 53107 (2604) + (ACTGTGACACC)n Satellite/Simple_repeat 1 75 (0) 4747276
> 13 30.100 1.400 4.200 GL456393.1 54216 54289 (1422) + (ACAGT)n Satellite/Simple_repeat 1 72 (0) 4747277
The only differences in these lines are the repeat names and the class/family. The spans and scores are unchanged.
repeat class/family
5363322,5363336c5363322,5363336
< AmnSINE1 SINE/5S-Deu-L2
< L1MdF_V LINE/L1
< L1MdF_V LINE/L1
< L1MdF_V LINE/L1
< (ACAGT)n Simple_repeat
< (ACAGT)n Simple_repeat
< RLTR10 LTR/ERVK
< RLTR10-int LTR/ERVK
< RLTR10-int LTR/ERVK
< RLTR10-int LTR/ERVK
< RLTR10-int LTR/ERVK
< RLTR10 LTR/ERVK
< RLTR10-int LTR/ERVK
< (ACTGTGACACC)n Simple_repeat
< (ACAGT)n Simple_repeat
---
> GSAT_MM Satellite/5S-Deu-L2
> GSAT_MM Satellite/L1
> GSAT_MM Satellite/L1
> GSAT_MM Satellite/L1
> (ACAGT)n Satellite/Simple_repeat
> (ACAGT)n Satellite/Simple_repeat
> GSAT_MM Satellite/ERVK
> GSAT_MM Satellite/ERVK
> GSAT_MM Satellite/ERVK
> GSAT_MM Satellite/ERVK
> GSAT_MM Satellite/ERVK
> GSAT_MM Satellite/ERVK
> GSAT_MM Satellite/ERVK
> (ACTGTGACACC)n Satellite/Simple_repeat
> (ACAGT)n Satellite/Simple_repeat
It is possible that a LINE, SINE or LTR may contain sequences that may match simple-repeats, but we wouldn't expect a block of 15 repeat spans would all change to the same class while still being reported as the original family.
It works fine running on this one sequence in isolation, but we're seeing the instability in the context of running many sequences from a genome assembly. What might be causing this problem?
That's a bizarre result and I suspect some sort of post processing bug caused that. If you look at the scores/divergences/indel stats they are all identical. Can you reproduce this with a limited set of input sequences?
We're trying to work up a more controlled test scenario to reproduce this. We've been having other stability issues with RepeatMasker as well, with silent failures stochastically occurring on some sequences, which show up as non-reproducible output on the same sequences. Have you ever heard of problems like that? We keep upping the available memory, but it hasn't helped (but is suggestive of a memory issue).