sarek fastp step makes machine unresponsive

I am running the pipeline on multiple files, and the fastp step performed concurrently on all files (which is what I believe is happening) slows down and makes the machine unresponsive.

Now, I can see one cannot skip this step. I wonder whether there is any suggestion on how to queue this step to run sequentially, or perhaps another way to skip it all together or other suggestions are very much appreciated.

Sep 29 '22 08:09 stefcardinale

you can skip it if you set --split_fastq 0 and do no trimming. If you need to do trimming but don't want to do splitting, you can also JUST enable trimming. Not sure why it becomes unresponsive though. We can look into it. Can you provide more info: command, log files, versions?

Sep 29 '22 08:09 FriederikeHanssen

I think I ran into this same problem with my current run.

I started a 160-sample joint-germline variant calling run, and both the network traffic and the processor load on the cluster's Ceph gateway immediately went through the roof as the fastp jobs spawned, to the point where the system started taking nodes offline because they were unresponsive. Jobs did eventually get through, and nodes came back online, but performance was severely impacted for other users.

As a workaround, I killed the run and added the following to a custom config:

executor {
    name = 'pbspro'
    submitRateLimit = '1min'
}

Resuming the run then allowed it to continue, as it was only spawning one job per minute (which is probably overly conservative). Once the fastp step was done I increased the rate to 10/min and it's still running smoothly so far.

My guess is that fastp is doing some particularly intensive read-write, and trying to run 100 processes at once just overloads the system, but I'm not really au fait with the details of either fastp or cluster architecture!

Oct 03 '22 00:10 DrMcStrange

Hi, just to maybe "close" this - I would second your interpretation that this is a limitation of your CEPH storage system. CEPH is not strictly a high-performance, highly parallel file system so it's quite reasonable to assume that fastP is overloading it in your particular setting. There is unfortunately nothing that can easily be done on the pipeline side, other than the things suggested by Frederike. Your workaround seems like the best solution, since this is more or less a limitation of your local setting.

Oct 27 '22 08:10 marchoeppner

@marchoeppner thank you for chiming in here. From this I see confirmed what we suspected before, it is a setup issue and not a sarek issue. given these steps can be skipped and thus the execution tailored to the particular local setup, I would close this issue for now. If you have any more issues with respect to this, please re-open :)

Nov 02 '22 10:11 FriederikeHanssen

sarek sarek copied to clipboard

fastp step makes machine unresponsive

sarek
sarek copied to clipboard