parfit
parfit copied to clipboard
error: 'i' format requires -2147483648 <= number <= 2147483647
Hi,
I get error: 'i' format requires -2147483648 <= number <= 2147483647
Doing exactly same as README.md except I am using RandomForestRegressor()
Full error :
---------------------------------------------------------------------------
_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/avinash/anaconda3/envs/venv_py3.6/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 346, in _sendback_result
exception=exception))
File "/Users/avinash/anaconda3/envs/venv_py3.6/lib/python3.6/site-packages/joblib/externals/loky/backend/queues.py", line 241, in put
self._writer.send_bytes(obj)
File "/Users/avinash/anaconda3/envs/venv_py3.6/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/Users/avinash/anaconda3/envs/venv_py3.6/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes
header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
"""
The above exception was the direct cause of the following exception:
error Traceback (most recent call last)
<ipython-input-10-f10ba30832f6> in <module>
11 X_train_5, y_train_5, X_test_5, y_test_5, # nfolds=5 [optional, instead of validation set]
12 metric=roc_auc_score, greater_is_better=True,
---> 13 scoreLabel='AUC')
14
15 print(best_model, best_score)
~/anaconda3/envs/venv_py3.6/lib/python3.6/site-packages/parfit/parfit.py in bestFit(model, paramGrid, X_train, y_train, X_val, y_val, nfolds, metric, greater_is_better, predict_proba, showPlot, scoreLabel, vrange, cmap, n_jobs, verbose)
63 else:
64 print("-------------FITTING MODELS-------------")
---> 65 models = fitModels(model, paramGrid, X_train, y_train, n_jobs, verbose)
66 print("-------------SCORING MODELS-------------")
67 scores = scoreModels(models, X_val, y_val, metric, predict_proba, n_jobs, verbose)
~/anaconda3/envs/venv_py3.6/lib/python3.6/site-packages/parfit/fit.py in fitModels(model, paramGrid, X, y, n_jobs, verbose)
49 myModels = fitModels(model, paramGrid, X_train, y_train)
50 """
---> 51 return Parallel(n_jobs=n_jobs, verbose=verbose)(delayed(fitOne)(model, X, y, params) for params in paramGrid)
~/anaconda3/envs/venv_py3.6/lib/python3.6/site-packages/joblib/parallel.py in __call__(self, iterable)
994
995 with self._backend.retrieval_context():
--> 996 self.retrieve()
997 # Make sure that we get a last message telling us we are done
998 elapsed_time = time.time() - self._start_time
~/anaconda3/envs/venv_py3.6/lib/python3.6/site-packages/joblib/parallel.py in retrieve(self)
897 try:
898 if getattr(self._backend, 'supports_timeout', False):
--> 899 self._output.extend(job.get(timeout=self.timeout))
900 else:
901 self._output.extend(job.get())
~/anaconda3/envs/venv_py3.6/lib/python3.6/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
515 AsyncResults.get from multiprocessing."""
516 try:
--> 517 return future.result(timeout=timeout)
518 except LokyTimeoutError:
519 raise TimeoutError()
~/anaconda3/envs/venv_py3.6/lib/python3.6/concurrent/futures/_base.py in result(self, timeout)
430 raise CancelledError()
431 elif self._state == FINISHED:
--> 432 return self.__get_result()
433 else:
434 raise TimeoutError()
~/anaconda3/envs/venv_py3.6/lib/python3.6/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
error: 'i' format requires -2147483648 <= number <= 2147483647
Please help.
Hi @avinash-mishra,
Thanks for raising this issue. Could you please share with me the ParameterGrid object you are searching over?
Hi @jmcarpenter2 Thanks for the quick reply.
grid = {
'min_samples_leaf': [1, 5, 10],
'max_features': ['sqrt'],
'n_estimators': [60],
'n_jobs': [-1],
'random_state': [42]
}
paramGrid = ParameterGrid(grid)
best_model, best_score, all_models, all_scores = bestFit(RandomForestRegressor(), paramGrid,
X_train_5, y_train_5, X_test_5, y_test_5, # nfolds=5 [optional, instead of validation set]
metric=roc_auc_score, greater_is_better=True,
scoreLabel='AUC')
print(best_model, best_score)
ParameterGrid is exactly same as given in README file. I tried to search and found a SO Link
Some people have said that pickling the model object is way too heavy. My df looks like this.
display(X_train_5.shape)
display(y_train_5.shape)
display(X_test_5.shape)
display(y_test_5.shape)
(16861, 119)
(16861, 329)
(1240, 119)
(1240, 329)
I hope it will be helpful for you to look into the issue and suggest some fix.
Hi @avinash-mishra,
This is an interesting issue. It appears it has something to do with the combination of trying to train models on massive dataframes, and the fact that parfit underlying utilizes multiprocessing rather than multithreading. I will look into solutions, but it may take awhile to actually implement a fix that resolves your use case.
As a side note, I am wondering why your y_train_5
and y_test_5
dataframes have 210 more columns than the X_train_5
and X_test_5
? Shouldnt y be a pandas series (i.e. a 1 column dataframe)?
Thanks
Hi @jmcarpenter2 It is a multi-column regression issue. A specific use case. I have to predict multiple columns not only one.