dask-jobqueue
dask-jobqueue copied to clipboard
LSF Cluster worker killed without reaching maximum memory limit when used with TPOT and featuretools
Thank you for the amazing work on dask and dask-jobqueue. I was very excited to test it on my institute's HPC on top of LSF.
However, when I use dask backend with either TPOT or Featuretools, there was alot of Worker restart
warning and then the computation terminates with Killed Worker
error (I suspect Dask reached the default 3 times retry). The strange thing is I was monitoring the workers on the dashboard when they died, and at no point did they went over the memory limit I specified at initialization (always at 1/10 the limit).
The code and screen recording of worker monitoring can be found here:
- Notebook (the first section is data processing, and I used dask in the attempt 2 section: https://github.com/hoangthienan95/sharing/blob/master/TPOT_test_error_report.ipynb
- Video (program terminated at 1:33 - 1:40): https://streamable.com/zenu6
I posted more information about the error on the TPOT issue page: Originally posted by @hoangthienan95 in https://github.com/EpistasisLab/tpot/issues/847#issuecomment-473577730
For featuretools, I'm trying to create a reproducible error (it's very finicky) and will update later, but essentially I was trying to create 2 millions unique features on the same dataset (22k rows, ~70 columns) . I put them in a dask bag
and mapped the featuretools' calculate_feature_matrix function on it. I described it in more details in this SO question . The problem is the same: Killed worker without reaching memory limit (of up to 256 GB!) and very slow/terminated computation.
I have read alot of tutorials/documentation/github issues on Dask, TPOT and Featuretools and still can't find an answer. I also don't know if it's a problem with dask or the other libraries.
FYI, the cluster I'm using looks like this:
Any help in diagnosing the problem is greatly appreciated!
Hi @hoangthienan95, and thanks for the introduction, this is really appreciated, and I hope you will enjoy using it!
First of all, this is not the first time I see a problem reported with Tpot, see https://github.com/dask/distributed/issues/2540 or #169. With the same kind of error you experiment. cc @TomAugspurger.
On your video, I find the charts on workers memory use strange, with vertical bars instead of horizontal ones, so this is not clear... On your notebook, attempt 2, you apparently didn't change the memory parameter of your cluster, so you've got still 15GB per worker.
A few thoughts on your config, even if this is probably not linked to your problem: you should use bigger workers, e.g. a full compute node, or at least a half one (cores=24, memory=256GB). You should define local_directory
kwarg, I see you've got plenty of dask worker folders on your shared file system, this is never a good thing.
In order to be sure of the problem, please inspect LSF jobs stderr and stdout to get dask logs, and see if there is something about a memory issue. I think this is the major thing from http://jobqueue.dask.org/en/latest/debug.html that is missing from your detailed analysis.
Hi @hoangthienan95, did you had time to investigate this further?
Closing this issue as stale. @hoangthienan95 feel free to reopen if you're still working on this.