seqtk icon indicating copy to clipboard operation
seqtk copied to clipboard

subseq => core dumped

Open ndaniel opened this issue 9 years ago • 8 comments

It looks like when there is not enough free memory available for "seqtk subseq", then a crash happens with this error message: "Segmentation fault (core dumped)".

For example running this:

seqtk subseq reads-of-size-35-GB.fq list-of-size-20-GB.txt > output.fq

on a computer with 16GB RAM will produce this error messge "Segmentation fault (core dumped)".

The expected behavior would be to end graciously with a message like "not enough memory" and so on!

ndaniel avatar Dec 30 '14 04:12 ndaniel

The stk_reg_read() function which allocates the hash for the sequence IDs uses the malloc() and realloc() system calls but does not check to see if they failed (return NULL).

Because you loaded a file bigger than your RAM size, it failed. The simplest solution is to break your list.txt file into smaller pieces and do them one at a time. You can use the Unix split command to do that.

tseemann avatar Aug 18 '16 17:08 tseemann

Unfortunately it is not possible to split the 'list.txt' because SeqTK is run as part of a automatic pipeline (that is https://github.com/ndaniel/fusioncatcher ) on servers which run in parallel other tasks/jobs. Also splitting the 'list.txt' incurs a penalty time. Therefore the free memory available on the server is changing all the time and it makes it impossible to predict it beforehand if SEQTK has enough free memory or not.

The right way to fix this bug is to check the return code of malloc() and realloc() in SeqTK and throw a nice error message with an exit error code.

ndaniel avatar Aug 19 '16 12:08 ndaniel

@ndaniel Sure - an error code would be good practice - but fusioncatcher will still fail?

Another option is to use grep -f -F list.txt -A 3 file.fastq

tseemann avatar Aug 20 '16 03:08 tseemann

@ndaniel Sure - an error code would be good practice - but fusioncatcher will still fail?

Usually SeqTK will become a zombie process and it will refuse in many cases to die for hours (and even days) afterwards when it was not able to allocate memory successfully. Therefore the pipeline which is using it will also hang in a zombie mode for hours and days because of it.

Another option is to use grep -f -F list.txt -A 3 file.fastq

At first glance it looks to me that this might not work on all FASTQ files, like for example the FASTQ files which have had their ids compressed using lossy compression (e.g. read id is AAA). I have had already a replacement in place when SeqTK fails, that is extract_short_reads.py, but this is slower then SeqTK. Therefore there is no need for workarounds. It would be nice if the bug would just be fixed!

ndaniel avatar Aug 22 '16 09:08 ndaniel

@ndaniel yes it would be good if it failed correctly so you could fallback onto your python script.

but given you know how much RAM you have, can you just fallback when you know the size of the list.txt file is too big?

also, i don't think making the read IDs shorter will help, because the code uses a fixed length hashing string to store the IDs.

tseemann avatar Aug 23 '16 08:08 tseemann

but given you know how much RAM you have, can you just fallback when you know the size of the list.txt file is too big?

No. It is impossible to predict how much available free RAM is at a given time in the future on a server which runs/starts/stops simultaneously several tasks/programs.

also, i don't think making the read IDs shorter will help, because the code uses a fixed length hashing string to store the IDs.

Yes, it does help according to my tests.

ndaniel avatar Aug 23 '16 08:08 ndaniel

seqtk subseq reads-of-size-35-GB.fq list-of-size-20-GB.txt > output.fq

Since the whole IDs list needs to be stored in RAM, a memory efficient data structure like _BloomFilter_ could be used for checking the existence of a ID. And this probabilistic data may has false positives but does not has false negatives, so we can use the subtract (IDs not in sequence file) to ensure the correctness.

This needs more steps:

  1. compute the substract
  2. construct the bloom filter with substract
  3. extract sequences not in the substract

_update_: the step 1 needs read whole IDs too!!! so this may be not a good solution.

shenwei356 avatar Aug 23 '16 09:08 shenwei356

@shenwei356

For now I am using this extract_short_reads.py for cases when the reads ids do not fit in the memory and it works well BUT it is slower than SeqTK.

ndaniel avatar Aug 23 '16 09:08 ndaniel