seqtk
seqtk copied to clipboard
subseq => core dumped
It looks like when there is not enough free memory available for "seqtk subseq", then a crash happens with this error message: "Segmentation fault (core dumped)".
For example running this:
seqtk subseq reads-of-size-35-GB.fq list-of-size-20-GB.txt > output.fq
on a computer with 16GB RAM will produce this error messge "Segmentation fault (core dumped)".
The expected behavior would be to end graciously with a message like "not enough memory" and so on!
The stk_reg_read()
function which allocates the hash for the sequence IDs uses the malloc()
and realloc()
system calls but does not check to see if they failed (return NULL
).
Because you loaded a file bigger than your RAM size, it failed. The simplest solution is to break your list.txt
file into smaller pieces and do them one at a time. You can use the Unix split
command to do that.
Unfortunately it is not possible to split the 'list.txt' because SeqTK is run as part of a automatic pipeline (that is https://github.com/ndaniel/fusioncatcher ) on servers which run in parallel other tasks/jobs. Also splitting the 'list.txt' incurs a penalty time. Therefore the free memory available on the server is changing all the time and it makes it impossible to predict it beforehand if SEQTK has enough free memory or not.
The right way to fix this bug is to check the return code of malloc() and realloc() in SeqTK and throw a nice error message with an exit error code.
@ndaniel Sure - an error code would be good practice - but fusioncatcher will still fail?
Another option is to use grep -f -F list.txt -A 3 file.fastq
@ndaniel Sure - an error code would be good practice - but fusioncatcher will still fail?
Usually SeqTK will become a zombie process and it will refuse in many cases to die for hours (and even days) afterwards when it was not able to allocate memory successfully. Therefore the pipeline which is using it will also hang in a zombie mode for hours and days because of it.
Another option is to use grep -f -F list.txt -A 3 file.fastq
At first glance it looks to me that this might not work on all FASTQ files, like for example the FASTQ files which have had their ids compressed using lossy compression (e.g. read id is AAA). I have had already a replacement in place when SeqTK fails, that is extract_short_reads.py, but this is slower then SeqTK. Therefore there is no need for workarounds. It would be nice if the bug would just be fixed!
@ndaniel yes it would be good if it failed correctly so you could fallback onto your python script.
but given you know how much RAM you have, can you just fallback when you know the size of the list.txt
file is too big?
also, i don't think making the read IDs shorter will help, because the code uses a fixed length hashing string to store the IDs.
but given you know how much RAM you have, can you just fallback when you know the size of the list.txt file is too big?
No. It is impossible to predict how much available free RAM is at a given time in the future on a server which runs/starts/stops simultaneously several tasks/programs.
also, i don't think making the read IDs shorter will help, because the code uses a fixed length hashing string to store the IDs.
Yes, it does help according to my tests.
seqtk subseq reads-of-size-35-GB.fq list-of-size-20-GB.txt > output.fq
Since the whole IDs list needs to be stored in RAM, a memory efficient data structure like _BloomFilter_ could be used for checking the existence of a ID. And this probabilistic data may has false positives but does not has false negatives, so we can use the subtract (IDs not in sequence file) to ensure the correctness.
This needs more steps:
- compute the substract
- construct the bloom filter with substract
- extract sequences not in the substract
_update_: the step 1 needs read whole IDs too!!! so this may be not a good solution.
@shenwei356
For now I am using this extract_short_reads.py for cases when the reads ids do not fit in the memory and it works well BUT it is slower than SeqTK.