kedro
kedro copied to clipboard
Investigate why Spaceflights project failing with `ParallelRunner`
Description
Flagged by failing CI on kedro-docker
https://github.com/kedro-org/kedro-plugins/issues/558
Basically, scikit-learn
(which is a dependency of the spaceflights-*
starters) had a new release on 16th Feb - https://pypi.org/project/scikit-learn/1.4.1.post1/ which doesn't play well with the ParallelRunner
Context
Stacktrace
INFO Running node: train_model_node: train_model([X_train;y_train]) -> node.py:340
[regressor]
ERROR Node train_model_node: train_model([X_train;y_train]) -> failed with node.py:365
error:
cannot set WRITEABLE flag to True of this array
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/lib/python3.11/concurrent/futures/process.py", line 261, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ankita_katiyar/kedro/kedro/kedro/runner/parallel_runner.py", line 91, in _run_node_synchronization
return run_node(node, catalog, hook_manager, is_async, session_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ankita_katiyar/kedro/kedro/kedro/runner/runner.py", line 331, in run_node
node = _run_node_sequential(node, catalog, hook_manager, session_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ankita_katiyar/kedro/kedro/kedro/runner/runner.py", line 424, in _run_node_sequential
outputs = _call_node_run(
^^^^^^^^^^^^^^^
File "/Users/ankita_katiyar/kedro/kedro/kedro/runner/runner.py", line 390, in _call_node_run
raise exc
File "/Users/ankita_katiyar/kedro/kedro/kedro/runner/runner.py", line 380, in _call_node_run
outputs = node.run(inputs)
^^^^^^^^^^^^^^^^
File "/Users/ankita_katiyar/kedro/kedro/kedro/pipeline/node.py", line 371, in run
raise exc
File "/Users/ankita_katiyar/kedro/kedro/kedro/pipeline/node.py", line 357, in run
outputs = self._run_with_list(inputs, self._inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ankita_katiyar/kedro/kedro/kedro/pipeline/node.py", line 402, in _run_with_list
return self._func(*(inputs[item] for item in node_inputs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ankita_katiyar/.Trash/demo-project/src/demo_project/pipelines/data_science/nodes.py", line 38, in train_model
regressor.fit(X_train, y_train)
File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/lib/python3.11/site-packages/sklearn/base.py", line 1474, in wrapper
return fit_method(estimator, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/lib/python3.11/site-packages/sklearn/linear_model/_base.py", line 578, in fit
X, y = self._validate_data(
^^^^^^^^^^^^^^^^^^^^
File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/lib/python3.11/site-packages/sklearn/base.py", line 650, in _validate_data
X, y = check_X_y(X, y, **check_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1279, in check_X_y
y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1289, in _check_y
y = check_array(
^^^^^^^^^^^^
File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1097, in check_array
array.flags.writeable = True
^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot set WRITEABLE flag to True of this array
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/bin/kedro", line 8, in <module>
sys.exit(main())
^^^^^^
File "/Users/ankita_katiyar/kedro/kedro/kedro/framework/cli/cli.py", line 198, in main
cli_collection()
File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ankita_katiyar/kedro/kedro/kedro/framework/cli/cli.py", line 127, in main
super().main(
File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/lib/python3.11/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ankita_katiyar/kedro/kedro/kedro/framework/cli/project.py", line 225, in run
session.run(
File "/Users/ankita_katiyar/kedro/kedro/kedro/framework/session/session.py", line 392, in run
run_result = runner.run(
^^^^^^^^^^^
File "/Users/ankita_katiyar/kedro/kedro/kedro/runner/runner.py", line 117, in run
self._run(pipeline, catalog, hook_or_null_manager, session_id) # type: ignore[arg-type]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ankita_katiyar/kedro/kedro/kedro/runner/parallel_runner.py", line 314, in _run
node = future.result()
^^^^^^^^^^^^^^^
File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/lib/python3.11/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
ValueError: cannot set WRITEABLE flag to True of this array
Related https://github.com/scikit-learn/scikit-learn/pull/28348
Steps to Reproduce
kedro run --runner=ParallelRunner
Also reminder to revert changes in https://github.com/kedro-org/kedro-plugins/pull/591 after this is resolved
The outcome for this ticket is to investigate what's the root cause and propose a solution to fix it
Potential causes:
- Version of scikit-learn
- ParallelRunner
- Starter
Tested with:
-
scikit-learn : 1.4.1.post1
-
numpy==1.26.4
Things explored so far:
- The error is happening when
scikit-learn
validates input data https://github.com/scikit-learn/scikit-learn/blob/941acc419b8e7bec86fdc6b27ab3c4703022f140/sklearn/utils/validation.py#L1099 - The validation includes converting input data to
numpy array
and then settingarray.flags.writeable = True
https://github.com/scikit-learn/scikit-learn/blob/941acc419b8e7bec86fdc6b27ab3c4703022f140/sklearn/utils/_array_api.py#L712 - Setting the above flag causes the end error:
ValueError: cannot set WRITEABLE flag to True of this array
- If checking the flags for the converted array you might see that
OWNDATA
isFalse
, so the array owns the memory it uses or borrows it from another object. https://numpy.org/doc/stable/reference/generated/numpy.ndarray.flags.html
C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
- From here and (local experiments) we see that we cannot change
WRITEABLE
attribute if the original array hasWRITEABLE =False
- So it looks like created
numpy array
shares memory with another object which is notWRITEABLE
- The
numpy array
is created straight from input provided and the error happens only forpandas.core.series.Series
, the same conversion works well forpandas.core.frame.DataFrame
After further investigation, it was found out that the problem appears after the object is retrieved from SharedMemoryDataset
. In the example below we convert pandas.core.series.Series
to numpy array
and then set up WRITEABLE=True
which works well. After the object was saved to SharedMemoryDataset
and then retrieved OWNDATA
flag became False
and changing WRITEABLE
gives an error.
input_path = Path.cwd() / "data"
y_train = pd.read_csv(input_path / "02_intermediate" / "y_train.csv")
# converting to series
y_train = y_train.stack()
print(type(y_train))
test_y = numpy.asarray(y_train, order=None, dtype=None)
print(test_y.flags)
test_y.flags.writeable = True
manager = ParallelRunnerManager()
manager.start()
dataset = SharedMemoryDataset(manager=manager)
dataset._save(y_train)
out = dataset._load()
print(type(out))
test_y = numpy.asarray(out, order=None, dtype=None)
print(test_y.flags)
test_y.flags.writeable = True
Output:
A further plan is to investigate what's happening in the SharedMemoryDataset
, whether it's expected, and why it only affects pandas.core.series.Series
.
In the earlier scikit-learn
versions <= 1.4.0
the following step is absent, so the error is not happening:
# With an input pandas dataframe or series, we know we can always make the
# resulting array writeable:
# - if copy=True, we have already made a copy so it is fine to make the
# array writeable
# - if copy=False, the caller is telling us explicitly that we can do
# in-place modifications
# See https://pandas.pydata.org/docs/dev/user_guide/copy_on_write.html#read-only-numpy-arrays
# for more details about pandas copy-on-write mechanism, that is enabled by
# default in pandas 3.0.0.dev.
if _is_pandas_df_or_series(array_orig) and hasattr(array, "flags"):
array.flags.writeable = True
With the test below it was confirmed that the problem is in SharedMemoryDataset
as exactly the same example as above but with MemoryDataset
works well.
input_path = Path.cwd() / "data"
y_train = pd.read_csv(input_path / "02_intermediate" / "y_train.csv")
# converting to series
y_train = y_train.stack()
print(type(y_train))
test_y = numpy.asarray(y_train, order=None, dtype=None)
print(test_y.flags)
test_y.flags.writeable = True
dataset = MemoryDataset()
dataset._save(y_train)
out = dataset._load()
print(type(out))
test_y = numpy.asarray(out, order=None, dtype=None)
print(test_y.flags)
test_y.flags.writeable = True
Further tests excluded the kedro code base. The actual problem happens when using multiprocessing.managers.BaseManager
inside the ParallelRunner
. We registering MemoryDataset
to be used with multiprocessing.managers.BaseManager
as follows:
class ParallelRunnerManager(SyncManager):
"""``ParallelRunnerManager`` is used to create shared ``MemoryDataset``
objects as default data sets in a pipeline.
"""
ParallelRunnerManager.register("MemoryDataset", MemoryDataset)
When running ParallelRunner
places MemoryDataset
into shared memory and returns a proxy of MemoryDataset
object.
https://docs.python.org/3/library/multiprocessing.shared_memory.html
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.managers.BaseManager
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.managers.BaseProxy
After we retrieve a dataset from MemoryDataset
proxy object - we get this error if setting WRITEABLE=True
from multiprocessing.managers import BaseManager
from kedro.io import MemoryDataset
class MyManager(BaseManager): pass
MyManager.register("MemoryDataset", MemoryDataset, exposed=('_save', '_load'))
def main():
input_path = Path.cwd() / "data"
y_train = pd.read_csv(input_path / "02_intermediate" / "y_train.csv")
y_train = y_train.stack()
print(type(y_train))
test_y = numpy.asarray(y_train, order=None, dtype=None)
print(test_y.flags)
test_y.flags.writeable = True
manager = MyManager()
manager.start()
dataset = manager.MemoryDataset()
dataset._save(y_train)
out = dataset._load()
print(type(out))
test_y_out = numpy.asarray(out, order=None, dtype=None)
print(test_y_out.flags)
test_y_out.flags.writeable = True
The reason for the above is that numpy
doesn't allow arrays based on read-only buffer to be set as writeable. Possible reason of why the behaviour differs for pd.DataFrame
and pd.Series
is that the conversion numpy.asarray()
happens in a different way, so that in pd.DataFrame
case we are getting the copy of the object.
Thus making a copy of loaded from MemoryDataset
pd.Series
object solves the problem.
from multiprocessing.managers import BaseManager
from kedro.io import MemoryDataset
class MyManager(BaseManager): pass
MyManager.register("MemoryDataset", MemoryDataset, exposed=('_save', '_load'))
input_path = Path.cwd() / "data"
y_train = pd.read_csv(input_path / "02_intermediate" / "y_train.csv")
y_train = y_train.stack()
print(type(y_train))
test_y = numpy.asarray(y_train, order=None, dtype=None)
print(test_y.flags)
test_y.flags.writeable = True
manager = MyManager()
manager.start()
dataset = manager.MemoryDataset()
dataset._save(y_train)
out = copy.deepcopy(dataset._load())
print(type(out))
test_y_out = numpy.asarray(out, order=None, dtype=None)
print(test_y_out.flags)
test_y_out.flags.writeable = True
So the solution that might work for us is to modify the part where we retrieve data from the catalog before calling the node function here:
def _run_node_sequential(
node: Node,
catalog: DataCatalog,
hook_manager: PluginManager,
session_id: str | None = None,
) -> Node:
inputs = {}
for name in node.inputs:
hook_manager.hook.before_dataset_loaded(dataset_name=name, node=node)
data = catalog.load(name)
if isinstance(data, pd.Series):
inputs[name] = copy.deepcopy(data)
else:
inputs[name] = data
hook_manager.hook.after_dataset_loaded(
dataset_name=name, data=inputs[name], node=node
)
Tested this locally and it works.
Summary:
- the problem relates to shared memory usage
- the problem is not on our side; at least it's not a bug made by us
- if not addressing it most probably remains with all the new
scikit-learn
versions as all seem valid on their side as well - there's a solution described above which doesn't seem too good as well
@noklam, @ankatiyar, @merelcht, @astrojuanlu need your thoughts here on whether we want to apply the suggested fix, though it might take time to follow through all my comments above 🙂
@ElenaKhaustova Can you point to the changes that you have made?
I wonder if there is anything we can report upstream and create an example that strip away the kedro related context. From what I've read the problem is not a bug of pandas
or numpy
, but rather scikit-learn
did a validation and update the flag. So maybe we should report this upstream to scikit-learn
.
cannot set WRITEABLE flag to True of this array
Google: https://www.google.com/search?q=cannot+set+writeable+flag+to+true+of+this+array&oq=cannot+set+WRITEABLE+flag+to+True+of+this+array&gs_lcrp=EgZjaHJvbWUqDAgAECMYJxiABBiKBTIMCAAQIxgnGIAEGIoFMggIARAAGBYYHjINCAIQABiGAxiABBiKBTINCAMQABiGAxiABBiKBdIBBzI4MmowajeoAgCwAgA&sourceid=chrome&ie=UTF-8
Searching this bug there's tons of report everywhere, some are libraries compatibility issue.
Is this a scikit-learn
problem? It seems that from your latest comment you can reproduce the same issue even with just SharedMemoryDataset
and numpy
.
Can you also point me to the change that works?
@ElenaKhaustova Can you point to the changes that you have made?
I wonder if there is anything we can report upstream and create an example that strip away the kedro related context. From what I've read the problem is not a bug of
pandas
ornumpy
, but ratherscikit-learn
did a validation and update the flag. So maybe we should report this upstream toscikit-learn
.
These are the changes: https://github.com/kedro-org/kedro/issues/3674#issuecomment-2045291676 I'll open a draft PR as well for better visibility.
Yes, we can strip away the kedro related context by creating fake MemoryDataset
with save
and load
methods. We might try to report this though it doesn't seem like a bug from their side as well.
I can create a fake dataset and add scikit-learn
logic to showcase an error if we want to report them.
Oh sorry I didn't notice that was the change. This remind me of something. If you check MemoryDataset
.
if copy_mode == "deepcopy":
copied_data = copy.deepcopy(data)
elif copy_mode == "copy":
copied_data = data.copy()
elif copy_mode == "assign":
copied_data = data
We already have something like this, maybe we just need to do update _infer_copy_mode
?
def _infer_copy_mode(data: Any) -> str:
"""Infers the copy mode to use given the data type.
Args:
data: The data whose type will be used to infer the copy mode.
Returns:
One of "copy", "assign" or "deepcopy" as the copy mode to use.
"""
try:
import pandas as pd
except ImportError: # pragma: no cover
pd = None # type: ignore[assignment] # pragma: no cover
try:
import numpy as np
except ImportError: # pragma: no cover
np = None # type: ignore[assignment] # pragma: no cover
if pd and isinstance(data, pd.DataFrame) or np and isinstance(data, np.ndarray):
copy_mode = "copy"
elif type(data).__name__ == "DataFrame":
copy_mode = "assign"
else:
copy_mode = "deepcopy"
return copy_mode
Oh sorry I didn't notice that was the change. This remind me of something. If you check
MemoryDataset
.if copy_mode == "deepcopy": copied_data = copy.deepcopy(data) elif copy_mode == "copy": copied_data = data.copy() elif copy_mode == "assign": copied_data = data
We already have something like this, maybe we just need to do update
_infer_copy_mode
?def _infer_copy_mode(data: Any) -> str: """Infers the copy mode to use given the data type. Args: data: The data whose type will be used to infer the copy mode. Returns: One of "copy", "assign" or "deepcopy" as the copy mode to use. """ try: import pandas as pd except ImportError: # pragma: no cover pd = None # type: ignore[assignment] # pragma: no cover try: import numpy as np except ImportError: # pragma: no cover np = None # type: ignore[assignment] # pragma: no cover if pd and isinstance(data, pd.DataFrame) or np and isinstance(data, np.ndarray): copy_mode = "copy" elif type(data).__name__ == "DataFrame": copy_mode = "assign" else: copy_mode = "deepcopy" return copy_mode
Here is the draft pr: https://github.com/kedro-org/kedro/pull/3795/files
The problem is that we have to make the copy after the data was retrieved (after load()
) from the catalog, but the _infer_copy_mode
is done inside before it, so it doesn't change anything. So we cannot make it there 🙁
Example with stripped kedro logic to reproduce the error.
from concurrent.futures import ProcessPoolExecutor
from multiprocessing.managers import BaseManager
import traceback
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
class MemoryDataset:
def __init__(self):
self._ds = None
def save(self, ds):
self._ds = ds
def load(self):
return self._ds
def train_model(dataset: MemoryDataset) -> LinearRegression:
regressor = LinearRegression()
X_train, y_train = dataset.load()
try:
regressor.fit(X_train, y_train)
except Exception as _:
print(traceback.format_exc())
return regressor
class MyManager(BaseManager):
pass
MyManager.register("MemoryDataset", MemoryDataset, exposed=("save", "load"))
def main():
rng = np.random.default_rng()
n_samples = 1000
X_train = pd.DataFrame(rng.random((n_samples, 4)), columns=list('ABCD'))
y_train = pd.Series(rng.random(n_samples))
futures = set()
manager = MyManager()
manager.start()
dataset = manager.MemoryDataset()
dataset.save((X_train, y_train))
with ProcessPoolExecutor(max_workers=1) as pool:
futures.add(pool.submit(train_model, dataset))
if __name__ == "__main__":
main()
Looks like this is mostly an upstream bug and there's little we can do about it. Unfortunately this means that ParallelRunner
is mostly broken for a good chunk of basic use cases.
Removing this from our sprints, for now.
I can reproduce this issue with the SequentialRunner
.
(Rough) steps:
- Create a project using
spaceflights-pandas-viz
- Serialise
X_test
andy_test
:
X_test:
type: pickle.PickleDataset
filepath: data/05_model_input/X_test.pkl
y_test:
type: pickle.PickleDataset
filepath: data/05_model_input/y_test.pkl
- Execute the pipeline until that point:
$ kedro run --to-outputs=X_test,y_test
- Run the pipeline from the inference node:
$ kedro run --from-nodes=evaluate_model_node
Full traceback:
Traceback (most recent call last):
File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/bin/kedro", line 8, in <module>
sys.exit(main())
^^^^^^
File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/kedro/framework/cli/cli.py", line 233, in main
cli_collection()
File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/kedro/framework/cli/cli.py", line 130, in main
super().main(
File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/kedro/framework/cli/project.py", line 225, in run
session.run(
File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/kedro/framework/session/session.py", line 395, in run
run_result = runner.run(
^^^^^^^^^^^
File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/kedro/runner/runner.py", line 117, in run
self._run(pipeline, catalog, hook_or_null_manager, session_id) # type: ignore[arg-type]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/kedro/runner/sequential_runner.py", line 75, in _run
run_node(node, catalog, hook_manager, self._is_async, session_id)
File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/kedro/runner/runner.py", line 413, in run_node
node = _run_node_sequential(node, catalog, hook_manager, session_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/kedro/runner/runner.py", line 506, in _run_node_sequential
outputs = _call_node_run(
^^^^^^^^^^^^^^^
File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/kedro/runner/runner.py", line 472, in _call_node_run
raise exc
File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/kedro/runner/runner.py", line 462, in _call_node_run
outputs = node.run(inputs)
^^^^^^^^^^^^^^^^
File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/kedro/pipeline/node.py", line 392, in run
raise exc
File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/kedro/pipeline/node.py", line 378, in run
outputs = self._run_with_list(inputs, self._inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/kedro/pipeline/node.py", line 423, in _run_with_list
return self._func(*(inputs[item] for item in node_inputs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/src/spaceflights_mlflow/pipelines/data_science/nodes.py", line 54, in evaluate_model
mae = mean_absolute_error(y_test, y_pred)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/sklearn/utils/_param_validation.py", line 213, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/sklearn/metrics/_regression.py", line 216, in mean_absolute_error
y_type, y_true, y_pred, multioutput = _check_reg_targets(
^^^^^^^^^^^^^^^^^^^
File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/sklearn/metrics/_regression.py", line 112, in _check_reg_targets
y_true = check_array(y_true, ensure_2d=False, dtype=dtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1107, in check_array
array.flags.writeable = True
^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot set WRITEABLE flag to True of this array
It's funny because the array is WRITEABLE
already anyway.
❯ python -m pdb -m kedro run --from-nodes=evaluate_model_node --params mlflow_run_id=4cba849c8f2d403887e95dbef109
1142 --runner=SequentialRunner
> /Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/kedro/__main__.py(1)<module>()
-> """Entry point when invoked with python -m kedro.""" # pragma: no cover
(Pdb) b /Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/sklearn/utils/validation.py:1107
Breakpoint 1 at /Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/sklearn/utils/validation.py:1107
(Pdb) c
[06/04/24 11:31:57] INFO Using `conf/logging.yml` as logging configuration. You can change __init__.py:249
this by setting the KEDRO_LOGGING_CONFIG environment variable
accordingly.
[06/04/24 11:32:01] INFO Kedro project spaceflights-mlflow session.py:324
INFO Registering new custom resolver: 'km.random_name' mlflow_hook.py:65
INFO The 'tracking_uri' key in mlflow.yml is relative kedro_mlflow_config.py:260
('server.mlflow_(tracking|registry)_uri = mlflow_runs').
It is converted to a valid uri:
'file:///Users/juan_cano/Projects/QuantumBlackLabs/kedro-
mlflow-playground/spaceflights-mlflow/mlflow_runs'
[06/04/24 11:32:08] INFO Logging extra metadata to MLflow hooks.py:13
INFO Using synchronous mode for loading and saving data. Use the sequential_runner.py:64
--async flag for potential performance gains.
https://docs.kedro.org/en/stable/nodes_and_pipelines/run_a_p
ipeline.html#load-and-save-asynchronously
INFO Loading data from regressor (MlflowModelTrackingDataset)... data_catalog.py:508
INFO Loading data from X_test (PickleDataset)... data_catalog.py:508
INFO Loading data from y_test (PickleDataset)... data_catalog.py:508
INFO Running node: evaluate_model_node: node.py:361
evaluate_model([regressor;X_test;y_test]) -> [metrics]
> /Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/sklearn/utils/validation.py(1107)check_array()
-> array.flags.writeable = True
(Pdb) p array.flags
C_CONTIGUOUS : False
F_CONTIGUOUS : True
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
Seems to have nothing to do with Kedro:
import pickle
from sklearn.metrics import mean_absolute_error
with open("_data/X_test.pkl", "rb") as fh:
X_test = pickle.load(fh)
with open("_data/y_test.pkl", "rb") as fh:
y_test = pickle.load(fh)
with open("_data/regressor.pickle", "rb") as fh:
regressor = pickle.load(fh)
y_pred = regressor.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
Attaching the contents of _data
.
And my uv pip freeze
:
joblib==1.4.2
numpy==1.26.4
pandas==2.2.2
python-dateutil==2.9.0.post0
pytz==2024.1
scikit-learn==1.5.0
scipy==1.13.1
six==1.16.0
threadpoolctl==3.5.0
tzdata==2024.1
And Python version:
$ python -VV
Python 3.11.5 (main, Aug 24 2023, 15:09:45) [Clang 14.0.3 (clang-1403.0.22.14.1)]
Seems to have nothing to do with Kedro:
That's sad it's still in the 1.5.0
version. Maybe we can open one more issue on their side since it is a completely different example causing the same behaviour?
There's a PR which can mitigate the problem but not solve it completely, since there's a conversation if setting writable=True
is correct in general: https://github.com/scikit-learn/scikit-learn/issues/28824
Maybe we can open one more issue on their side since it is a completely different example causing the same behaviour?
I'd love to do it myself but I prefer to focus on other things, if you have a moment feel free!
Maybe we can open one more issue on their side since it is a completely different example causing the same behaviour?
I'd love to do it myself but I prefer to focus on other things, if you have a moment feel free!
Done: https://github.com/scikit-learn/scikit-learn/issues/29182
I confirm https://github.com/scikit-learn/scikit-learn/pull/29018 fixes this issue 🚀