dask-jobqueue LSF Cluster worker killed without reaching maximum memory limit when used with TPOT and featuretools

LSF Cluster worker killed without reaching maximum memory limit when used with TPOT and featuretools

Open hoangthienan95 opened this issue 5 years ago • 2 comments

Thank you for the amazing work on dask and dask-jobqueue. I was very excited to test it on my institute's HPC on top of LSF.

However, when I use dask backend with either TPOT or Featuretools, there was alot of Worker restart warning and then the computation terminates with Killed Worker error (I suspect Dask reached the default 3 times retry). The strange thing is I was monitoring the workers on the dashboard when they died, and at no point did they went over the memory limit I specified at initialization (always at 1/10 the limit).

The code and screen recording of worker monitoring can be found here:

Notebook (the first section is data processing, and I used dask in the attempt 2 section: https://github.com/hoangthienan95/sharing/blob/master/TPOT_test_error_report.ipynb
Video (program terminated at 1:33 - 1:40): https://streamable.com/zenu6

I posted more information about the error on the TPOT issue page: Originally posted by @hoangthienan95 in https://github.com/EpistasisLab/tpot/issues/847#issuecomment-473577730

For featuretools, I'm trying to create a reproducible error (it's very finicky) and will update later, but essentially I was trying to create 2 millions unique features on the same dataset (22k rows, ~70 columns) . I put them in a dask bag and mapped the featuretools' calculate_feature_matrix function on it. I described it in more details in this SO question . The problem is the same: Killed worker without reaching memory limit (of up to 256 GB!) and very slow/terminated computation.

I have read alot of tutorials/documentation/github issues on Dask, TPOT and Featuretools and still can't find an answer. I also don't know if it's a problem with dask or the other libraries.

FYI, the cluster I'm using looks like this:

Any help in diagnosing the problem is greatly appreciated!

Mar 18 '19 22:03 hoangthienan95

Hi @hoangthienan95, and thanks for the introduction, this is really appreciated, and I hope you will enjoy using it!

First of all, this is not the first time I see a problem reported with Tpot, see https://github.com/dask/distributed/issues/2540 or #169. With the same kind of error you experiment. cc @TomAugspurger.

On your video, I find the charts on workers memory use strange, with vertical bars instead of horizontal ones, so this is not clear... On your notebook, attempt 2, you apparently didn't change the memory parameter of your cluster, so you've got still 15GB per worker.

A few thoughts on your config, even if this is probably not linked to your problem: you should use bigger workers, e.g. a full compute node, or at least a half one (cores=24, memory=256GB). You should define local_directory kwarg, I see you've got plenty of dask worker folders on your shared file system, this is never a good thing.

In order to be sure of the problem, please inspect LSF jobs stderr and stdout to get dask logs, and see if there is something about a memory issue. I think this is the major thing from http://jobqueue.dask.org/en/latest/debug.html that is missing from your detailed analysis.

Mar 22 '19 12:03 guillaumeeb

Hi @hoangthienan95, did you had time to investigate this further?

May 09 '19 14:05 guillaumeeb

Closing this issue as stale. @hoangthienan95 feel free to reopen if you're still working on this.

Aug 30 '22 06:08 guillaumeeb

dask-jobqueue dask-jobqueue copied to clipboard

LSF Cluster worker killed without reaching maximum memory limit when used with TPOT and featuretools

dask-jobqueue
dask-jobqueue copied to clipboard