genometools icon indicating copy to clipboard operation
genometools copied to clipboard

Aborted (core dumped) with LTR harvest

Open xiaoxiaonao opened this issue 4 years ago • 5 comments

Problem description

While using LTRharvest this error pops up:

Assertion failed: (refrng.start <= boundaries->leftLTR_5), function gt_removeoverlapswithlowersimilarity, file src/ltr/ltrharvest_stream.c, line
1222.
This is a bug, please report it at
https://github.com/genometools/genometools/issues
Please make sure you are running the latest release which can be found at
http://genometools.org/pub/
You can check your version number with `gt -version`.
Aborted (core dumped)

Exact command line call triggering the problem

gt suffixerator -db Ps_genome.part-05.fasta  -indexname Ps_genome.part-05 -tis -suf -lcp -des -ssp -sds -dna

After creating the index, submit the following command:

gt ltrharvest -index Ps_genome.part-05 -minlenltr 100 -maxlenltr 3000 -similar 80 -gff3 Ps_genome.part-05_inner.fa > Ps_genome.part-05_harvest.scn

Example minimal input triggering the problem

What GenomeTools version are you reporting an issue for (as output by gt -version)?

gt (GenomeTools) 1.6.2 Copyright (c) 2003-2016 G. Gremme, S. Steinbiss, S. Kurtz, and CONTRIBUTORS Copyright (c) 2003-2016 Center for Bioinformatics, University of Hamburg See LICENSE file or http://genometools.org/license.html for license details.

Used compiler: cc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5) Compile flags: -g -Wall -Wunused-parameter -pipe -fPIC -Wpointer-arith -Wno-unknown-pragmas -O3 -Werror

Did you compile GenomeTools from source? If so, please state the make parameters used.

What operating system (e.g. Ubuntu, Mac OS X), OS version (e.g. 15.10, 10.11) and platform (e.g. x86_64) are you using?

CentOS Linux 8 (Core)

xiaoxiaonao avatar Jan 04 '22 01:01 xiaoxiaonao

Hi, thanks for reporting this. To properly reproduce the error and determine the root cause, though, I need the input sequence you used (Ps_genome.part-05.fasta). It would be great if you could provide this file, or, alternatively, a snippet of this sequence that triggers this issue without having to reveal too much of your input.

I also noticed that your value for maxlenltr is very large (3000). Could you try to also adjust mindistltr to account for that and to prevent overlapping LTRs? In this case it should be at least 3000. If that helps, then maybe LTRharvest should check this condition at the start.

satta avatar Jan 04 '22 10:01 satta

Dear Sascha, Attached is the input file(Ps_genome.part-05.fasta.gz)  .  The file is over 2G when unzipped.   ------------------ Original ------------------ From: @.>; Date:  Tue, Jan 4, 2022 06:17 PM To: @.>; Cc: @.>; @.>; Subject:  Re: [genometools/genometools] Aborted (core dumped) with LTR harvest (Issue #999)

 

Hi, thanks for reporting this. To properly reproduce the error and determine the root cause, I would need the input sequence you used (Ps_genome.part-05.fasta). It would be great if you could provide this file, or, alternatively, a snippet of this sequence that triggers this issue without having to reveal too much of your input.

I also noticed that your value for maxlenltr is very large (3000). Could you try to also adjust mindistltr to account for that and to prevent overlapping LTRs? In this case it should be at least 3000. If that helps, then maybe LTRharvest should check this condition at the start.

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID: @.***>

从腾讯企业邮箱发来的超大附件

Ps_genome.part-05.fasta.gz (586.6M, 2022年02月03日 19:25 到期)进入下载页面:http://mail.qq.com/cgi-bin/ftnExs_download?t=exs_ftn_download&k=353934376f586fe23f1621ab45660e0b4a0b555656550b5b561403545153140d550b061a5b52095c480a5654525408005d01565306662339354a6b50070856540017445610121409501752561112581702433409&code=e947bf99&fid=72/2aa432b3-7c35-4022-940e-3bc021988bdd

xiaoxiaonao avatar Jan 04 '22 11:01 xiaoxiaonao

Thanks, I downloaded the file and will try to reproduce the issue. LTRharvest is running quite long... have you masked all short and tandem repeats before running LTRharvest? Otherwise the seed hits will explode, unnecessarily blowing up the run time.

satta avatar Jan 04 '22 22:01 satta

I have not marked any short and tandem repeats before running LTRharvest. The error was reported after two weeks of operation. It is also difficult to annotate tandem repeats due to their length.

xiaoxiaonao avatar Jan 05 '22 01:01 xiaoxiaonao

Ouch, I see. Two weeks -- LTRharvest definitely should never run that long! I would strongly advise to at least use RepeatMasker to mask low-complexity repeats in the source. It is not recommended to just run LTRharvest on the raw sequence if there are many and long instances of such repeats. With the default seed size of 30 these will lead to lots of potential candidate pairs to be evaluated, which will excessively inflate the run time. You likely need to prepare the input sequence a bit.

My suggestion:

  • Mask low-complexity repeats (ideally by hard-masking them by converting them to N)
  • Annotate and mask other known transposons if you want (any LINE/SINE/... if they exist in your organism, just leave out LTR elements from your reference set) so you will only run LTRharvest on previously unexamined sequence.
  • Adjust the seed length. If you are expecting really long, say >1kb, LTRs with the default 80% similarity they you can use longer seeds, which will also reduce run time a lot.
  • Adjust the mindistltr to be at least your maxlenltr or do multiple runs with different settings.

Regarding the original error: I am afraid I will not be able to run the software for two weeks each time I need to reproduce the error as I don't have a compute farm at my disposal any more. Is there any way you could come up with a smaller sequence stretch that triggers the issue?

satta avatar Jan 05 '22 08:01 satta