TPOT stuck at 75th generation with no errors
I am running the GPU-accelerated (using dask) configuration of TPOT (TPOT version 0.11.7) on a couple of different data. I am also using TPOT cuML in the configuration. I am using python 3 with anaconda.
For all the data, TPOT is getting stuck at generation 74 or 75, no matter the size of the databases (some of them are 480 rows, 10 columns, up to 9000 rows, 83 columns). No error is output, the periodic checkpoint folder just stops updating, and no new messages appear. I have left it running but after 8 hours nothing new came up. I have changed the random seed of the TPOT regressor to see if it would be an issue with a specific model architecture, but changing the seed still results in it getting stuck at generation 75.
My tpot regressor looks as follows: tpot = TPOTRegressor(verbosity=2, use_dask = True, n_jobs=-1, cv=5, random_state=42, #this was changed, as mentioned above template='Regressor', config_dict='TPOT cuML', periodic_checkpoint_folder='../checkpoints/{}/'.format(target), max_time_mins = None )
Any idea how to solve this issue/ why it would be happening every time for different data? Thank you!
You should use n_jobs=1 (the default). cuML is currently designed for the "one process per GPU" paradigm". Additionally, how are you setting up your Dask cluster?
It might be valuable to test your system and environment with this example gist or confirm your configuration is similar.
I just got caught by this, there needs to be a better error message when using cuML and leaving the n_jobs set to -1.
If the maintainers are open to it, perhaps we could open a PR that validates the n_jobs parameter when the cuML configuration is used.
If the maintainers are open to it, perhaps we could open a PR that validates the
n_jobsparameter when the cuML configuration is used.
Yes probably just >0 as I think you can use multiple GPUs