RepeatMasker icon indicating copy to clipboard operation
RepeatMasker copied to clipboard

instability in repeat names/class/family

Open murphyte opened this issue 4 years ago • 2 comments

We're seeing instability in repeat names and class/family for otherwise identical repeats.

Here's a partial diff of two runs on the same sequence (GenBank GL456393.1, from mouse), both with: RepeatMasker version open-4.0.8 RepeatMasker Combined Database: Dfam_Consensus-20181026, RepBase-20181026

5363322,5363336c5363322,5363336
<   182   23.500   0.000   6.400  GL456393.1           20035      20084     (35627) + AmnSINE1       SINE/5S-Deu-L2    159    205  (370) 4747269
< 12489    9.800   1.000   0.100  GL456393.1           26166      28006     (27705) + L1MdF_V        LINE/L1          3043   4901 (1675) 4747270
<   997   14.000   0.000   0.000  GL456393.1           27992      28148     (27563) + L1MdF_V        LINE/L1          5429   5585  (994) 4747270 *
<  7239    8.400   0.700   1.100  GL456393.1           28151      29142     (26569) + L1MdF_V        LINE/L1          5588   6575    (4) 4747270
<    13   27.200   0.000   4.000  GL456393.1           39513      39564     (16147) + (ACAGT)n       Simple_repeat       1     50    (0) 4747271
<    12   22.500   1.900   6.000  GL456393.1           44988      45039     (10672) + (ACAGT)n       Simple_repeat       1     50    (0) 4747272
<  2712    5.200   1.200   6.200  GL456393.1           46531      46939      (8772) + RLTR10         LTR/ERVK            1    390    (0) 4747273
<  4590   17.600   1.300   0.900  GL456393.1           46940      47633      (8078) + RLTR10-int     LTR/ERVK          302    998  (989) 4747274
<   373   25.800   3.600   3.600  GL456393.1           47691      47827      (7884) + RLTR10-int     LTR/ERVK         1458   1594 (5987) 4747274
<   272   26.900   0.000   0.000  GL456393.1           47866      47932      (7779) + RLTR10-int     LTR/ERVK         1923   1989 (4167) 4747274 *
<  8521    1.700   0.000   0.400  GL456393.1           47910      48908      (6803) + RLTR10-int     LTR/ERVK          990   1984    (3) 4747275
<  2815    2.400   0.200  21.000  GL456393.1           48911      49381      (6330) + RLTR10         LTR/ERVK            1    390    (0) 4747275
<  2072    2.700   1.600   0.000  GL456393.1           49392      49648      (6063) + RLTR10-int     LTR/ERVK           41    301 (1686) 4747275
<    38    5.900   5.600   1.300  GL456393.1           53036      53107      (2604) + (ACTGTGACACC)n Simple_repeat       1     75    (0) 4747276
<    13   30.100   1.400   4.200  GL456393.1           54216      54289      (1422) + (ACAGT)n       Simple_repeat       1     72    (0) 4747277
---
>   182   23.500   0.000   6.400  GL456393.1           20035      20084     (35627) + GSAT_MM        Satellite/5S-Deu-L2    159    205  (370) 4747269
> 12489    9.800   1.000   0.100  GL456393.1           26166      28006     (27705) + GSAT_MM        Satellite/L1     3043   4901 (1675) 4747270
>   997   14.000   0.000   0.000  GL456393.1           27992      28148     (27563) + GSAT_MM        Satellite/L1     5429   5585  (994) 4747270 *
>  7239    8.400   0.700   1.100  GL456393.1           28151      29142     (26569) + GSAT_MM        Satellite/L1     5588   6575    (4) 4747270
>    13   27.200   0.000   4.000  GL456393.1           39513      39564     (16147) + (ACAGT)n       Satellite/Simple_repeat      1     50    (0) 4747271
>    12   22.500   1.900   6.000  GL456393.1           44988      45039     (10672) + (ACAGT)n       Satellite/Simple_repeat      1     50    (0) 4747272
>  2712    5.200   1.200   6.200  GL456393.1           46531      46939      (8772) + GSAT_MM        Satellite/ERVK      1    390    (0) 4747273
>  4590   17.600   1.300   0.900  GL456393.1           46940      47633      (8078) + GSAT_MM        Satellite/ERVK    302    998  (989) 4747274
>   373   25.800   3.600   3.600  GL456393.1           47691      47827      (7884) + GSAT_MM        Satellite/ERVK   1458   1594 (5987) 4747274
>   272   26.900   0.000   0.000  GL456393.1           47866      47932      (7779) + GSAT_MM        Satellite/ERVK   1923   1989 (4167) 4747274 *
>  8521    1.700   0.000   0.400  GL456393.1           47910      48908      (6803) + GSAT_MM        Satellite/ERVK    990   1984    (3) 4747275
>  2815    2.400   0.200  21.000  GL456393.1           48911      49381      (6330) + GSAT_MM        Satellite/ERVK      1    390    (0) 4747275
>  2072    2.700   1.600   0.000  GL456393.1           49392      49648      (6063) + GSAT_MM        Satellite/ERVK     41    301 (1686) 4747275
>    38    5.900   5.600   1.300  GL456393.1           53036      53107      (2604) + (ACTGTGACACC)n Satellite/Simple_repeat      1     75    (0) 4747276
>    13   30.100   1.400   4.200  GL456393.1           54216      54289      (1422) + (ACAGT)n       Satellite/Simple_repeat      1     72    (0) 4747277

The only differences in these lines are the repeat names and the class/family. The spans and scores are unchanged.

 repeat class/family
5363322,5363336c5363322,5363336
< AmnSINE1        SINE/5S-Deu-L2
< L1MdF_V         LINE/L1
< L1MdF_V         LINE/L1
< L1MdF_V         LINE/L1
< (ACAGT)n        Simple_repeat
< (ACAGT)n        Simple_repeat
< RLTR10          LTR/ERVK
< RLTR10-int      LTR/ERVK
< RLTR10-int      LTR/ERVK
< RLTR10-int      LTR/ERVK
< RLTR10-int      LTR/ERVK
< RLTR10          LTR/ERVK
< RLTR10-int      LTR/ERVK
< (ACTGTGACACC)n  Simple_repeat
< (ACAGT)n        Simple_repeat
---
> GSAT_MM         Satellite/5S-Deu-L2
> GSAT_MM         Satellite/L1
> GSAT_MM         Satellite/L1
> GSAT_MM         Satellite/L1
> (ACAGT)n        Satellite/Simple_repeat
> (ACAGT)n        Satellite/Simple_repeat
> GSAT_MM         Satellite/ERVK
> GSAT_MM         Satellite/ERVK
> GSAT_MM         Satellite/ERVK
> GSAT_MM         Satellite/ERVK
> GSAT_MM         Satellite/ERVK
> GSAT_MM         Satellite/ERVK
> GSAT_MM         Satellite/ERVK
> (ACTGTGACACC)n  Satellite/Simple_repeat
> (ACAGT)n        Satellite/Simple_repeat

It is possible that a LINE, SINE or LTR may contain sequences that may match simple-repeats, but we wouldn't expect a block of 15 repeat spans would all change to the same class while still being reported as the original family.

It works fine running on this one sequence in isolation, but we're seeing the instability in the context of running many sequences from a genome assembly. What might be causing this problem?

murphyte avatar Jun 12 '20 17:06 murphyte

That's a bizarre result and I suspect some sort of post processing bug caused that. If you look at the scores/divergences/indel stats they are all identical. Can you reproduce this with a limited set of input sequences?

rmhubley avatar Jun 12 '20 18:06 rmhubley

We're trying to work up a more controlled test scenario to reproduce this. We've been having other stability issues with RepeatMasker as well, with silent failures stochastically occurring on some sequences, which show up as non-reproducible output on the same sequences. Have you ever heard of problems like that? We keep upping the available memory, but it hasn't helped (but is suggestive of a memory issue).

murphyte avatar Jun 12 '20 20:06 murphyte