FASTK icon indicating copy to clipboard operation
FASTK copied to clipboard

Non derterministic segfault and miscs

Open Sebastien-Raguideau opened this issue 3 years ago • 2 comments

Hello,

Thanks for FastK, it is truly useful.

I'm looking at kmer coverage of contigs of a small assembly. It means that if I want to get kmer coverage/histogram for each single contig I need to create 1 fasta file per contig and apply FastK to it. It is quite involved. I do trivial parallelisation over all files. My pipeline stop randomly as a consequence of a segfault on random contigs. Running incriminated step by itself outside of the pipeline does not allow to reproduce the issue. Restarting the pipeline from scratch does make it segfault again but not on the same files. My intuition is that there might be an issue with multiple FastK instance writing in the same temporary folder? Here is an example of error message: /bin/bash: line 1: 193787 Segmentation fault (core dumped) FastK -t1 -T2 seqs/folder43/contig_12.fa -P'seqs/folder43'

I am having another similar problem with random segfault. This time with Logex with Logex -H 'result=A&.B' sample.ktab contigs.ktab In all examples I've looked at, segfault happened when contigs.ktab was empty (well formed table with 0 kmer). Though running the same command line outside of the pipeline works without issue and indeed produce a working .hist file (even though with no kmer). This second issue is less problematic in the sense that I can just pre-filter for empty contigs.ktab.

Additionally here are a few miscellaneous issues I encountered. I'm mostly puzzled by the first one:

  • Looking at kmer in small individual genes/contigs, for sequences of size 100+bp and k=40, I get the following error message:
    "FastK: Too much of the data is in reads less than the k-mer size". If I append "NN" at the end of the sequence, I obtain expected results without failure. And some other smaller sequences do not show that issue. I joined an example.
  • Fastmerge doesn't handle empty ktab. It segfaults.
  • Fastq.gz are unziped in the working directory but not removed after

If it's useful, I'm working on Ubuntu 16.04.7 LTS with gcc version 9.4.0.

Best, Seb

Sebastien-Raguideau avatar Jan 21 '22 14:01 Sebastien-Raguideau

Hi Seb,

Thanks for your input.  As best I recall all the temporary file 

names have the form <temp_path>/.... where <temp_path> is the temporary directory (seqs/folder/43 in your example) and root is the root of the first file after stripping of any suffixes (contig_12 in your example). So if these two strings are the same for any two calls then yes you are going to have a problem.

 Beyond that it is hard to say anything, albeit I'll mention that, 

if on a distributed file system jobs are too small, then I have observed such jobs can crash due to io synchronization failures (of the distributed system). What you describe may be in that regime, one indicator would be that the jobs that crash on any given attempt vary.

 I did check and fix problems involving empty tables so now both 

Logex and Fastmerge should be fine on empty tables.

I also fixed it so any files FastK unzipped are cleaned up (the code 

should have done so for normal exits, but I arranged it so clean up occurs in the event of an abnormal exit also).

 FastK was not really designed for arbitrarily short reads 

especially ones that are not much bigger than the k-mer size (40 by default). It expects a large corpus of data and actually "trains" on an initial 10Mbp or so (if its available) to determine how to distribute k-mers for the sort proper. With a 100bp sequence it just doesn't have enough data to train on and hence the error. But I thought about this and really it should just be a warning, FastK will work even if the training "fails", so I have changed the code appropriately. So your short example should now run to completion albeit you will see a warning statement.

 The changes above have been pushed to the github master. Please let 

me know if there is anything else I can do.

 Best,  Gene

On 1/21/22, 3:02 PM, Sebastien-Raguideau wrote:

Hello,

Thanks for FastK, it is truly useful.

I'm looking at kmer coverage of contigs of a small assembly. It means that if I want to get kmer coverage/histogram for each single contig I need to create 1 fasta file per contig and apply FastK to it. It is quite involved. I do trivial parallelisation over all files. My pipeline stop randomly as a consequence of a segfault on random contigs. Running incriminated step by itself outside of the pipeline does not allow to reproduce the issue. Restarting the pipeline from scratch does make it segfault again but not on the same files. My intuition is that there might be an issue with multiple FastK instance writing in the same temporary folder? Here is an example of error message: |/bin/bash: line 1: 193787 Segmentation fault (core dumped) FastK -t1 -T2 seqs/folder43/contig_12.fa -P'seqs/folder43' |

I am having another similar problem with random segfault. This time with Logex with |Logex -H 'result=A&.B' sample.ktab contigs.ktab| In all examples I've looked at, segfault happened when contigs.ktab was empty (well formed table with 0 kmer). Though running the same command line outside of the pipeline works without issue and indeed produce a working .hist file (even though with no kmer). This second issue is less problematic in the sense that I can just pre-filter for empty contigs.ktab.

Additionally here are a few miscellaneous issues I encountered. I'm mostly puzzled by the first one:

  • Looking at kmer in small individual genes/contigs, for sequences of size 100+bp and k=40, I get the following error message: "FastK: Too much of the data is in reads less than the k-mer size". If I append "NN" at the end of the sequence, I obtain expected results without failure. And some other smaller sequences do not show that issue. I joined an example https://github.com/thegenemyers/FASTK/files/7913702/example.txt.
  • Fastmerge doesn't handle empty ktab. It segfaults.
  • Fastq.gz are unziped in the working directory but not removed after

If it's useful, I'm working on Ubuntu 16.04.7 LTS with gcc version 9.4.0.

Best, Seb

— Reply to this email directly, view it on GitHub https://github.com/thegenemyers/FASTK/issues/15, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUSINRRPNWVODXKYX5QDKLUXFROBANCNFSM5MPTD5FQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

thegenemyers avatar Jan 30 '22 12:01 thegenemyers

Hi Gene,

Thanks a lot for all of this.

Yes, from my pipeline there is no possibility for 2 root to be identical. So it is likely not related to temporary files. I am not on a distributed system but I can see how overly frequent writing/reading could be an issue. I'll try to reduce the number of concurrent jobs and see if I can make the problem disappear.

I didn't realise that the remaining unzipped files was from me interrupting my pipeline.... That makes sense as it would not occurs all the time.

I am myself also unclear on the interest of looking at such small reads/contigs, It is quite possible that in the future I won't trust coverage of contigs under a certain size, though I am pretty sure It will always be under 10Mbp :)

Thanks again, Seb

Sebastien-Raguideau avatar Jan 31 '22 11:01 Sebastien-Raguideau