RepeatMasker icon indicating copy to clipboard operation
RepeatMasker copied to clipboard

Illegal division by zero at ProcessRepeats line 7999.

Open Knnnk opened this issue 1 year ago • 9 comments

Hi, I am using RepeatMasker v4.1.5 to mask repeats from a shrimp genome P. indicus, the *.fa.cat.gz was produced but the fa file cannot be masked. So I rerun the ProcessRepeats step: ProcessRepeats -lib all.lib -html -gff -maskSource Penaeus_indicus.fa Penaeus_indicus.fa.cat.gz But the program finished without any .masked, .gff, .html produced, which is confusing. The output message is as below, with many dots omitted. processing output: cycle 1 ........ cycle 2 ........ cycle 3............Illegal division by zero at /gpfshddpool/home/nikuo/miniconda3/envs/repeat/bin/ProcessRepeats line 7999. ..................................................................

Environment (please include as much of the following information as you can find out):

  • How did you install RepeatMasker? bioconda

  • The output of RepeatMasker -v

RepeatMasker version 4.1.5

  • Operating system and version. The output of uname -a and lsb_release -a can be used to find this.

Linux mgt01 4.18.0-305.el8.x86_64 #1 SMP Thu Apr 29 08:54:30 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux

Thanks in advance for your help!

Best Regards, Nik

Knnnk avatar Sep 03 '24 02:09 Knnnk

The part I’m still confused about is: when I ran RepeatMasker the first time, it didn’t report any errors, but it also didn’t output any GFF, HTML, or masked FA files. This was the command I used during the first run of RepeatMasker: RepeatMasker -parallel 16 -e ncbi -html -gff -dir ./ -lib all.lib Penaeus_indicus.fa。I will attach the output log as a file. Thank you again!

Knnnk avatar Sep 03 '24 03:09 Knnnk

out.60584.log

Knnnk avatar Sep 03 '24 03:09 Knnnk

Hi, dear RepeatMasker developers @rmhubley @jebrosen @diekhans ,

I apologize if my previous description was unclear. I am encountering a similar issue once again. I would greatly appreciate any assistance you could provide. I split the P. indicus genome into 12 pieces with similar disk usage, all of which have a bunch of complete sequences of chromosomes and contigs.

$ grep -c ^\> Splited/*.fa
Splited/part_10.fa:333
Splited/part_11.fa:223
Splited/part_12.fa:1860
Splited/part_1.fa:2222
Splited/part_2.fa:2222
Splited/part_3.fa:1778
Splited/part_4.fa:444
Splited/part_5.fa:334
Splited/part_6.fa:444
Splited/part_7.fa:334
Splited/part_8.fa:444
Splited/part_9.fa:333

I ran RepeatMasker for each part separately with this command:

RepeatMasker -parallel 16 -e ncbi -html -gff -dir ./ -lib all.lib Splited/part_1.fa

This time, I found that only the first part (Splited/part_1.fa) had an issue. The error message in its output was:

"……after many rounds of repeat identifying……

identifying Simple Repeats in batch 2301 of 2304

processing output:

cycle 1 .......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... cycle 2 .......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... cycle 3 .................................................................................................................................................................................................................................................................................................................................................................................................................................................."

The program ends with only a "part_1.fa.cat.gz" file. So I resun the ProcessRepeats Process again:

ProcessRepeats -lib all.lib -maskSource Splited/part_1.fa -gff part_1.fa.cat.gz

and the error is the same as I mentioned 4 days before:

Illegal division by zero at ~/miniconda3/envs/repeat/bin/ProcessRepeats line 7999.

It seems that I have encountered an issue that I am unable to understand or resolve on my own. I would be sincerely grateful for any assistance.

Best regards, Nik

Knnnk avatar Sep 07 '24 05:09 Knnnk

Can you share your part_1.fa.cat.gz file with me?

rmhubley avatar Sep 09 '24 21:09 rmhubley

I have uploaded it to Google Drive. Please check the link: https://drive.google.com/file/d/18r65XMBZFI4ctUEScO3vWNNGgyxlHzgm/view?usp=sharing

Knnnk avatar Sep 10 '24 00:09 Knnnk

Ok...this exercises a bug in ProcessRepeats that is pretty rare. I have fixed the problem and it will make it into the next release due out in a few days. Unfortunately, in looking through your data I think you have a much bigger problem with your library ("all.lib"). You appear to have some non-unique family identifies in this file. For example:

linear;#DNA/CMC-Chapaev
linear;#DNA/MULE-MuDR
linear;#LINE/Penelope
...

These sequence identifiers are parsed as "family_id#class/subclass". The family_ID in these three cases is "linear;" which is probably causing all sorts of chaos in the adjudication algorithms of ProcessRepeats. Unfortunately, the sanity checking portion of RepeatMasker is not checking for unique family_ids, rather it's considering the complete id ("family_id#class/subclass") when looking for duplicates in your -lib file. So it's not warning you up-front about this issue. You may also want to check that you haven't mixed more than one RepeatModeler library together ( 'rnd-#_family-#' familiy ids) as these auto-generated family names will get re-used with each subsequent run. Finally, I am a bit confused as to why you are getting such strong matches to simple repeats. Take this alignment for example:

459 18.46 1.86 5.81 Pin8879 947 1025 (4052) C linear;#DNA/Crypton-H (5960) 1278 1200 m_b1945s001i316

  Pin8879              947 TAGTGTGTGTATGTATATATATATATATATATATATATATATATATATAT 996
                             i i i i   i                                     
C linear;#DNA/C       1278 TAATATATATATATATATATATATATATATATATATATATATATATATAT 1229

  Pin8879              997 ATATATATATACATATACATATATATATA 1025
                                      i     i           
C linear;#DNA/C       1228 ATATATATATATATATATATATATATATA 1200

It looks as if you used the "-nolow" flag, yet I don't see that in your command-line above. Typically RepeatMasker would have identified this region first as a perfect tandem repeat, keeping it from being matched to your "linear;#DNA/Crypton-H" family (in this case). I see in your log that simple repeat searching is being run, yet when I try this on my copy of RepeatMasker I see that TRF finds this and masks it, preventing any matches to anything in my *.lib file. I would like to followup on that. Could you run a simple test for me?

# Create a small sequence file "test.fa" containing the following record
>seq1-Pin8879
ACGTGCGGTAGGACTGATCTAGTCAGTGGCTAGCTGCTGGATGC
TAGTGTGTGTATGTATATATATATATATATATATATATATATATATATAT
ATATATATATATATATATATATATATATAGACGTGCGACGATGTAGCT
AGACTACGTGCGCGCTATCGTGCTGATCATGCTGCTAGCTGATC

# Now run RepeatMasker with your all.lib library
RepeatMasker -e ncbi -html -gff -dir ./ -lib all.lib test.fa

This is what your test.fa.out file should look like:

   SW   perc perc perc  query         position in query    matching repeat          position in repeat
score   div. del. ins.  sequence      begin end   (left)   repeat   class/family  begin  end    (left)  ID

   78    0.0  0.0  0.0  seq1-Pin8879     58   123   (63) + (TA)n    Simple_repeat      1     66    (0)   1  

A single match to a TRF result and not:

459 18.46 1.86 5.81 Pin8879 947 1025 (4052) C linear;#DNA/Crypton-H (5960) 1278 1200

rmhubley avatar Sep 10 '24 17:09 rmhubley

Thank you for your attention and follow-up! Strangely, I didn't use the -nolow flag in my run. And this is the test output:

RepeatMasker version 4.1.5
Search Engine: NCBI/RMBLAST [ 2.14.1+ ]
Using Custom Repeat Library: all.lib

analyzing file test.fa

Checking for E. coli insertion elements
identifying Simple Repeats in batch 1 of 1
identifying matches to all.lib sequences in batch 1 of 1
identifying Simple Repeats in batch 1 of 1
processing output: 
cycle 1 
cycle 2 
cycle 3 
cycle 4 
cycle 5 
cycle 6 
cycle 7 
cycle 8 
cycle 9 
cycle 10 
Generating output... 
masking
done

The test.fa.out file is exactly the TRF matching result:

$ cat test.fa.out
   SW   perc perc perc  query         position in query    matching repeat          position in repeat
score   div. del. ins.  sequence      begin end   (left)   repeat   class/family  begin  end    (left)  ID

   78    0.0  0.0  0.0  seq1-Pin8879     58   123   (63) + (TA)n    Simple_repeat      1     66    (0)   1  

Knnnk avatar Sep 11 '24 02:09 Knnnk

Well, that looks good. Without access to your sequence and library it's hard to reproduce these alignments. My suggestion is to use the latest version of RepeatMasker ( 4.1.7-p1 ), fix the IDs in your library, and run a test sequence ( say 10MB of your genome ). Check the *.alignments file to see if you are getting a ton of false matching to simple repeats to these families that contain small sections of simple repeats. If it looks good, then rerun the full genome.

rmhubley avatar Sep 13 '24 21:09 rmhubley

Thank you very much for your patient guidance and help. I am a bit occupied at the moment, but I will let you know as soon as there are updates.

Knnnk avatar Sep 16 '24 08:09 Knnnk

I am closing this for now. Please let me know if you have any further problems.

rmhubley avatar Dec 04 '24 19:12 rmhubley