dask-jobqueue icon indicating copy to clipboard operation
dask-jobqueue copied to clipboard

LSF Cluster worker killed without reaching maximum memory limit when used with TPOT and featuretools

Open hoangthienan95 opened this issue 5 years ago • 2 comments

Thank you for the amazing work on dask and dask-jobqueue. I was very excited to test it on my institute's HPC on top of LSF.

However, when I use dask backend with either TPOT or Featuretools, there was alot of Worker restart warning and then the computation terminates with Killed Worker error (I suspect Dask reached the default 3 times retry). The strange thing is I was monitoring the workers on the dashboard when they died, and at no point did they went over the memory limit I specified at initialization (always at 1/10 the limit).

The code and screen recording of worker monitoring can be found here:

  • Notebook (the first section is data processing, and I used dask in the attempt 2 section: https://github.com/hoangthienan95/sharing/blob/master/TPOT_test_error_report.ipynb
  • Video (program terminated at 1:33 - 1:40): https://streamable.com/zenu6

I posted more information about the error on the TPOT issue page: Originally posted by @hoangthienan95 in https://github.com/EpistasisLab/tpot/issues/847#issuecomment-473577730


For featuretools, I'm trying to create a reproducible error (it's very finicky) and will update later, but essentially I was trying to create 2 millions unique features on the same dataset (22k rows, ~70 columns) . I put them in a dask bag and mapped the featuretools' calculate_feature_matrix function on it. I described it in more details in this SO question . The problem is the same: Killed worker without reaching memory limit (of up to 256 GB!) and very slow/terminated computation.

I have read alot of tutorials/documentation/github issues on Dask, TPOT and Featuretools and still can't find an answer. I also don't know if it's a problem with dask or the other libraries.

FYI, the cluster I'm using looks like this: image image

Any help in diagnosing the problem is greatly appreciated!

hoangthienan95 avatar Mar 18 '19 22:03 hoangthienan95

Hi @hoangthienan95, and thanks for the introduction, this is really appreciated, and I hope you will enjoy using it!

First of all, this is not the first time I see a problem reported with Tpot, see https://github.com/dask/distributed/issues/2540 or #169. With the same kind of error you experiment. cc @TomAugspurger.

On your video, I find the charts on workers memory use strange, with vertical bars instead of horizontal ones, so this is not clear... On your notebook, attempt 2, you apparently didn't change the memory parameter of your cluster, so you've got still 15GB per worker.

A few thoughts on your config, even if this is probably not linked to your problem: you should use bigger workers, e.g. a full compute node, or at least a half one (cores=24, memory=256GB). You should define local_directory kwarg, I see you've got plenty of dask worker folders on your shared file system, this is never a good thing.

In order to be sure of the problem, please inspect LSF jobs stderr and stdout to get dask logs, and see if there is something about a memory issue. I think this is the major thing from http://jobqueue.dask.org/en/latest/debug.html that is missing from your detailed analysis.

guillaumeeb avatar Mar 22 '19 12:03 guillaumeeb

Hi @hoangthienan95, did you had time to investigate this further?

guillaumeeb avatar May 09 '19 14:05 guillaumeeb

Closing this issue as stale. @hoangthienan95 feel free to reopen if you're still working on this.

guillaumeeb avatar Aug 30 '22 06:08 guillaumeeb