issue-tracking
issue-tracking copied to clipboard
joblib "Broken Pipe" using scikit-learn grid-search crossfold validation after importing Comet ML
Describe the Bug
After importing comet_ml a scikit-learn-based training script fails during sklearn grid search cross-validation: "broken pipe" exception in joblib. Works fine without import of comet_ml.
Expected behavior
Training script should execute to completion with import of comet_ml
Where is the issue?
Third Party Integrations (scikit-learn). Stack trace indicates calls into comet_ml monkey-patching.
To Reproduce
Steps to reproduce the behavior:
- import comet_ml
- instantiate a Comet ML experiment
- some exp.log... statements
- instantiate scikit-learn GridSearchCV and fit to initiate
Stack Trace
Fitting 5 folds for each of 36 candidates, totalling 180 fits
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/joblib/externals/loky/process_executor.py", line 463, in _process_worker
r = call_item()
File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/joblib/externals/loky/process_executor.py", line 291, in __call__
return self.fn(*self.args, **self.kwargs)
File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/joblib/parallel.py", line 598, in __call__
return [func(*args, **kwargs)
File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/joblib/parallel.py", line 598, in <listcomp>
return [func(*args, **kwargs)
File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/sklearn/utils/parallel.py", line 129, in __call__
return self.function(*args, **kwargs)
File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 949, in _fit_and_score
print(end_msg)
BrokenPipeError: [Errno 32] Broken pipe
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/redacted/projects/redacted/redacted/redacted/redacted-final_model.py", line 994, in <module>
main()
File "/home/redacted/projects/redacted/redacted/redacted/redacted-final_model.py", line 972, in main
run_experiment(
File "/home/redacted/projects/redacted/redacted/redacted/redacted-final_model.py", line 771, in run_experiment
pred_df, y_test, best_grid_rgr, X_train, X, y = run_xgb(df, pre_pipe, post_pipe, params)
File "/home/redacted/projects/redacted/redacted/redacted/redacted-final_model.py", line 740, in run_xgb
grid = grid.fit(X_train, y_train)
File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/comet_ml/monkey_patching.py", line 316, in wrapper
raise exception_raised
File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/comet_ml/monkey_patching.py", line 287, in wrapper
return_value = original(*args, **kwargs)
File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/sklearn/base.py", line 1474, in wrapper
return fit_method(estimator, *args, **kwargs)
File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/sklearn/model_selection/_search.py", line 970, in fit
self._run_search(evaluate_candidates)
File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/sklearn/model_selection/_search.py", line 1527, in _run_search
evaluate_candidates(ParameterGrid(self.param_grid))
File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/sklearn/model_selection/_search.py", line 916, in evaluate_candidates
out = parallel(
File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/sklearn/utils/parallel.py", line 67, in __call__
return super().__call__(iterable_with_config)
File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/joblib/parallel.py", line 2007, in __call__
return output if self.return_generator else list(output)
File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/joblib/parallel.py", line 1650, in _get_outputs
yield from self._retrieve()
File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/joblib/parallel.py", line 1754, in _retrieve
self._raise_error_fast()
File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/joblib/parallel.py", line 1789, in _raise_error_fast
error_job.get_result(self.timeout)
File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/joblib/parallel.py", line 745, in get_result
return self._return_or_raise()
File "/home/redacted/projects/redacted/redacted/redacted/venv/lib/python3.10/site-packages/joblib/parallel.py", line 763, in _return_or_raise
raise self._result
BrokenPipeError: [Errno 32] Broken pipe
COMET INFO: ---------------------------------------------------------------------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------------------------------------------------------------------
COMET INFO: Data:
COMET INFO: display_summary_level : 1
COMET INFO: name : outside_cheese_2516
Comet Debug Log
Screenshots or GIFs
N/A
Additional context (code fragment - fails on grid.fit)
# Instantiate & Fit Grid Search Object
grid = GridSearchCV(rgr, params, cv=5, n_jobs=-1, scoring=scoring, verbose=5)
grid = grid.fit(X_train, y_train)
Looking through your log (search for "Traceback") I see this issue:
comet_ml.vendor.nvidia_ml.pynvml.NVMLError_NotSupported: Not Supported
but that shouldn't cause any issues. I also see:
[[13.3],
[33.9],
[54.5],
[75.1],
[95.7]]
ValueError: can only convert an array of size 1 to a Python scalar
which could be a Comet bug.
Also:
ModuleNotFoundError: No module named 'graphviz'
Pip install graphviz (or another dot package) to see if that helps.
I addressed each of these issues except the NVML error (related to GPU drivers likely needed to log GPU metrics). The ValueError cleared up when I set COMET_DISABLE_AUTO_LOGGING=1. I also installed graphviz to clear up that issue.
I would agree that the ValueError seems like a CometML bug.
@DFuller134 thanks for your update! I'll pass on the details of the NVMLError_NotSupported
error to our engineering team.
Do you know what [[33.9], [54.5], [75.1], [95.7]]
is? If you are trying to log a parameter (or step or epoch) value, it can't be a list of values. I believe that these are the only places that this error could come from.