kedro icon indicating copy to clipboard operation
kedro copied to clipboard

Investigate why Spaceflights project failing with `ParallelRunner`

Open ankatiyar opened this issue 11 months ago • 25 comments

Description

Flagged by failing CI on kedro-docker https://github.com/kedro-org/kedro-plugins/issues/558 Basically, scikit-learn (which is a dependency of the spaceflights-* starters) had a new release on 16th Feb - https://pypi.org/project/scikit-learn/1.4.1.post1/ which doesn't play well with the ParallelRunner

Context

Stacktrace

                    INFO     Running node: train_model_node: train_model([X_train;y_train]) ->           node.py:340
                             [regressor]                                                                            
                    ERROR    Node train_model_node: train_model([X_train;y_train]) ->  failed with       node.py:365
                             error:                                                                                 
                             cannot set WRITEABLE flag to True of this array                                        
concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/lib/python3.11/concurrent/futures/process.py", line 261, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ankita_katiyar/kedro/kedro/kedro/runner/parallel_runner.py", line 91, in _run_node_synchronization
    return run_node(node, catalog, hook_manager, is_async, session_id)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ankita_katiyar/kedro/kedro/kedro/runner/runner.py", line 331, in run_node
    node = _run_node_sequential(node, catalog, hook_manager, session_id)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ankita_katiyar/kedro/kedro/kedro/runner/runner.py", line 424, in _run_node_sequential
    outputs = _call_node_run(
              ^^^^^^^^^^^^^^^
  File "/Users/ankita_katiyar/kedro/kedro/kedro/runner/runner.py", line 390, in _call_node_run
    raise exc
  File "/Users/ankita_katiyar/kedro/kedro/kedro/runner/runner.py", line 380, in _call_node_run
    outputs = node.run(inputs)
              ^^^^^^^^^^^^^^^^
  File "/Users/ankita_katiyar/kedro/kedro/kedro/pipeline/node.py", line 371, in run
    raise exc
  File "/Users/ankita_katiyar/kedro/kedro/kedro/pipeline/node.py", line 357, in run
    outputs = self._run_with_list(inputs, self._inputs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ankita_katiyar/kedro/kedro/kedro/pipeline/node.py", line 402, in _run_with_list
    return self._func(*(inputs[item] for item in node_inputs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ankita_katiyar/.Trash/demo-project/src/demo_project/pipelines/data_science/nodes.py", line 38, in train_model
    regressor.fit(X_train, y_train)
  File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/lib/python3.11/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/lib/python3.11/site-packages/sklearn/linear_model/_base.py", line 578, in fit
    X, y = self._validate_data(
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/lib/python3.11/site-packages/sklearn/base.py", line 650, in _validate_data
    X, y = check_X_y(X, y, **check_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1279, in check_X_y
    y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1289, in _check_y
    y = check_array(
        ^^^^^^^^^^^^
  File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1097, in check_array
    array.flags.writeable = True
    ^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot set WRITEABLE flag to True of this array
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/bin/kedro", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/ankita_katiyar/kedro/kedro/kedro/framework/cli/cli.py", line 198, in main
    cli_collection()
  File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ankita_katiyar/kedro/kedro/kedro/framework/cli/cli.py", line 127, in main
    super().main(
  File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ankita_katiyar/kedro/kedro/kedro/framework/cli/project.py", line 225, in run
    session.run(
  File "/Users/ankita_katiyar/kedro/kedro/kedro/framework/session/session.py", line 392, in run
    run_result = runner.run(
                 ^^^^^^^^^^^
  File "/Users/ankita_katiyar/kedro/kedro/kedro/runner/runner.py", line 117, in run
    self._run(pipeline, catalog, hook_or_null_manager, session_id)  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ankita_katiyar/kedro/kedro/kedro/runner/parallel_runner.py", line 314, in _run
    node = future.result()
           ^^^^^^^^^^^^^^^
  File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/ankita_katiyar/anaconda3/envs/kedro_dev/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
ValueError: cannot set WRITEABLE flag to True of this array

Related https://github.com/scikit-learn/scikit-learn/pull/28348

Steps to Reproduce

kedro run --runner=ParallelRunner

ankatiyar avatar Mar 01 '24 15:03 ankatiyar

Also reminder to revert changes in https://github.com/kedro-org/kedro-plugins/pull/591 after this is resolved

ankatiyar avatar Mar 01 '24 15:03 ankatiyar

The outcome for this ticket is to investigate what's the root cause and propose a solution to fix it

Potential causes:

  • Version of scikit-learn
  • ParallelRunner
  • Starter

noklam avatar Mar 25 '24 14:03 noklam

Tested with:

  • scikit-learn : 1.4.1.post1
  • numpy==1.26.4

Things explored so far:

  1. The error is happening when scikit-learn validates input data https://github.com/scikit-learn/scikit-learn/blob/941acc419b8e7bec86fdc6b27ab3c4703022f140/sklearn/utils/validation.py#L1099
  2. The validation includes converting input data to numpy array and then setting array.flags.writeable = True https://github.com/scikit-learn/scikit-learn/blob/941acc419b8e7bec86fdc6b27ab3c4703022f140/sklearn/utils/_array_api.py#L712
  3. Setting the above flag causes the end error: ValueError: cannot set WRITEABLE flag to True of this array
  4. If checking the flags for the converted array you might see that OWNDATA is False, so the array owns the memory it uses or borrows it from another object. https://numpy.org/doc/stable/reference/generated/numpy.ndarray.flags.html
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  1. From here and (local experiments) we see that we cannot change WRITEABLE attribute if the original array has WRITEABLE =False
  2. So it looks like created numpy array shares memory with another object which is not WRITEABLE
  3. The numpy array is created straight from input provided and the error happens only for pandas.core.series.Series, the same conversion works well for pandas.core.frame.DataFrame

ElenaKhaustova avatar Apr 05 '24 16:04 ElenaKhaustova

After further investigation, it was found out that the problem appears after the object is retrieved from SharedMemoryDataset. In the example below we convert pandas.core.series.Series to numpy array and then set up WRITEABLE=True which works well. After the object was saved to SharedMemoryDataset and then retrieved OWNDATA flag became False and changing WRITEABLE gives an error.

  input_path = Path.cwd() / "data"
  y_train = pd.read_csv(input_path / "02_intermediate" / "y_train.csv")
  # converting to series
  y_train = y_train.stack()

  print(type(y_train))
  test_y = numpy.asarray(y_train, order=None, dtype=None)
  print(test_y.flags)
  test_y.flags.writeable = True

  manager = ParallelRunnerManager()
  manager.start()
  dataset = SharedMemoryDataset(manager=manager)
  dataset._save(y_train)
  out = dataset._load()

  print(type(out))
  test_y = numpy.asarray(out, order=None, dtype=None)
  print(test_y.flags)
  test_y.flags.writeable = True

Output: Screenshot 2024-04-05 at 17 45 03

ElenaKhaustova avatar Apr 05 '24 16:04 ElenaKhaustova

A further plan is to investigate what's happening in the SharedMemoryDataset, whether it's expected, and why it only affects pandas.core.series.Series.

ElenaKhaustova avatar Apr 05 '24 16:04 ElenaKhaustova

In the earlier scikit-learn versions <= 1.4.0 the following step is absent, so the error is not happening:

  # With an input pandas dataframe or series, we know we can always make the
  # resulting array writeable:
  # - if copy=True, we have already made a copy so it is fine to make the
  #   array writeable
  # - if copy=False, the caller is telling us explicitly that we can do
  #   in-place modifications
  # See https://pandas.pydata.org/docs/dev/user_guide/copy_on_write.html#read-only-numpy-arrays
  # for more details about pandas copy-on-write mechanism, that is enabled by
  # default in pandas 3.0.0.dev.
  if _is_pandas_df_or_series(array_orig) and hasattr(array, "flags"):
      array.flags.writeable = True

ElenaKhaustova avatar Apr 08 '24 14:04 ElenaKhaustova

With the test below it was confirmed that the problem is in SharedMemoryDataset as exactly the same example as above but with MemoryDataset works well.

  input_path = Path.cwd() / "data"
  y_train = pd.read_csv(input_path / "02_intermediate" / "y_train.csv")
  # converting to series
  y_train = y_train.stack()

  print(type(y_train))
  test_y = numpy.asarray(y_train, order=None, dtype=None)
  print(test_y.flags)
  test_y.flags.writeable = True

  dataset = MemoryDataset()
  dataset._save(y_train)
  out = dataset._load()

  print(type(out))
  test_y = numpy.asarray(out, order=None, dtype=None)
  print(test_y.flags)
  test_y.flags.writeable = True
Screenshot 2024-04-09 at 13 48 02

ElenaKhaustova avatar Apr 09 '24 12:04 ElenaKhaustova

Further tests excluded the kedro code base. The actual problem happens when using multiprocessing.managers.BaseManager inside the ParallelRunner. We registering MemoryDataset to be used with multiprocessing.managers.BaseManager as follows:

class ParallelRunnerManager(SyncManager):
    """``ParallelRunnerManager`` is used to create shared ``MemoryDataset``
    objects as default data sets in a pipeline.
    """


ParallelRunnerManager.register("MemoryDataset", MemoryDataset)

When running ParallelRunner places MemoryDataset into shared memory and returns a proxy of MemoryDataset object. https://docs.python.org/3/library/multiprocessing.shared_memory.html https://docs.python.org/3/library/multiprocessing.html#multiprocessing.managers.BaseManager https://docs.python.org/3/library/multiprocessing.html#multiprocessing.managers.BaseProxy

After we retrieve a dataset from MemoryDataset proxy object - we get this error if setting WRITEABLE=True

from multiprocessing.managers import BaseManager
from kedro.io import MemoryDataset


class MyManager(BaseManager): pass
MyManager.register("MemoryDataset", MemoryDataset, exposed=('_save', '_load'))


def main():
    input_path = Path.cwd() / "data"
    y_train = pd.read_csv(input_path / "02_intermediate" / "y_train.csv")
    y_train = y_train.stack()

    print(type(y_train))
    test_y = numpy.asarray(y_train, order=None, dtype=None)
    print(test_y.flags)
    test_y.flags.writeable = True

    manager = MyManager()
    manager.start()
    dataset = manager.MemoryDataset()
    dataset._save(y_train)
    out = dataset._load()

    print(type(out))
    test_y_out = numpy.asarray(out, order=None, dtype=None)
    print(test_y_out.flags)
    test_y_out.flags.writeable = True
Screenshot 2024-04-09 at 14 58 50

ElenaKhaustova avatar Apr 09 '24 13:04 ElenaKhaustova

The reason for the above is that numpy doesn't allow arrays based on read-only buffer to be set as writeable. Possible reason of why the behaviour differs for pd.DataFrame and pd.Series is that the conversion numpy.asarray() happens in a different way, so that in pd.DataFrame case we are getting the copy of the object.

Thus making a copy of loaded from MemoryDataset pd.Series object solves the problem.

from multiprocessing.managers import BaseManager
from kedro.io import MemoryDataset


class MyManager(BaseManager): pass
MyManager.register("MemoryDataset", MemoryDataset, exposed=('_save', '_load'))

input_path = Path.cwd() / "data"
y_train = pd.read_csv(input_path / "02_intermediate" / "y_train.csv")
y_train = y_train.stack()

print(type(y_train))
test_y = numpy.asarray(y_train, order=None, dtype=None)
print(test_y.flags)
test_y.flags.writeable = True

manager = MyManager()
manager.start()
dataset = manager.MemoryDataset()
dataset._save(y_train)
out = copy.deepcopy(dataset._load())

print(type(out))
test_y_out = numpy.asarray(out, order=None, dtype=None)
print(test_y_out.flags)
test_y_out.flags.writeable = True
Screenshot 2024-04-09 at 15 09 38

ElenaKhaustova avatar Apr 09 '24 14:04 ElenaKhaustova

So the solution that might work for us is to modify the part where we retrieve data from the catalog before calling the node function here:

def _run_node_sequential(
    node: Node,
    catalog: DataCatalog,
    hook_manager: PluginManager,
    session_id: str | None = None,
) -> Node:
    inputs = {}

    for name in node.inputs:
        hook_manager.hook.before_dataset_loaded(dataset_name=name, node=node)
        data = catalog.load(name)
        if isinstance(data, pd.Series):
            inputs[name] = copy.deepcopy(data)
        else:
            inputs[name] = data
        hook_manager.hook.after_dataset_loaded(
            dataset_name=name, data=inputs[name], node=node
        )

Tested this locally and it works.

ElenaKhaustova avatar Apr 09 '24 14:04 ElenaKhaustova

Summary:

  • the problem relates to shared memory usage
  • the problem is not on our side; at least it's not a bug made by us
  • if not addressing it most probably remains with all the new scikit-learn versions as all seem valid on their side as well
  • there's a solution described above which doesn't seem too good as well

@noklam, @ankatiyar, @merelcht, @astrojuanlu need your thoughts here on whether we want to apply the suggested fix, though it might take time to follow through all my comments above 🙂

ElenaKhaustova avatar Apr 09 '24 14:04 ElenaKhaustova

@ElenaKhaustova Can you point to the changes that you have made?

I wonder if there is anything we can report upstream and create an example that strip away the kedro related context. From what I've read the problem is not a bug of pandas or numpy, but rather scikit-learn did a validation and update the flag. So maybe we should report this upstream to scikit-learn.

noklam avatar Apr 09 '24 14:04 noklam

cannot set WRITEABLE flag to True of this array

Google: https://www.google.com/search?q=cannot+set+writeable+flag+to+true+of+this+array&oq=cannot+set+WRITEABLE+flag+to+True+of+this+array&gs_lcrp=EgZjaHJvbWUqDAgAECMYJxiABBiKBTIMCAAQIxgnGIAEGIoFMggIARAAGBYYHjINCAIQABiGAxiABBiKBTINCAMQABiGAxiABBiKBdIBBzI4MmowajeoAgCwAgA&sourceid=chrome&ie=UTF-8

Searching this bug there's tons of report everywhere, some are libraries compatibility issue.

Is this a scikit-learn problem? It seems that from your latest comment you can reproduce the same issue even with just SharedMemoryDataset and numpy.

Can you also point me to the change that works?

noklam avatar Apr 09 '24 15:04 noklam

@ElenaKhaustova Can you point to the changes that you have made?

I wonder if there is anything we can report upstream and create an example that strip away the kedro related context. From what I've read the problem is not a bug of pandas or numpy, but rather scikit-learn did a validation and update the flag. So maybe we should report this upstream to scikit-learn.

These are the changes: https://github.com/kedro-org/kedro/issues/3674#issuecomment-2045291676 I'll open a draft PR as well for better visibility.

Yes, we can strip away the kedro related context by creating fake MemoryDataset with save and load methods. We might try to report this though it doesn't seem like a bug from their side as well.

I can create a fake dataset and add scikit-learn logic to showcase an error if we want to report them.

ElenaKhaustova avatar Apr 09 '24 15:04 ElenaKhaustova

Oh sorry I didn't notice that was the change. This remind me of something. If you check MemoryDataset.

    if copy_mode == "deepcopy":
        copied_data = copy.deepcopy(data)
    elif copy_mode == "copy":
        copied_data = data.copy()
    elif copy_mode == "assign":
        copied_data = data

We already have something like this, maybe we just need to do update _infer_copy_mode?

def _infer_copy_mode(data: Any) -> str:
    """Infers the copy mode to use given the data type.

    Args:
        data: The data whose type will be used to infer the copy mode.

    Returns:
        One of "copy", "assign" or "deepcopy" as the copy mode to use.
    """
    try:
        import pandas as pd
    except ImportError:  # pragma: no cover
        pd = None  # type: ignore[assignment]  # pragma: no cover
    try:
        import numpy as np
    except ImportError:  # pragma: no cover
        np = None  # type: ignore[assignment] # pragma: no cover

    if pd and isinstance(data, pd.DataFrame) or np and isinstance(data, np.ndarray):
        copy_mode = "copy"
    elif type(data).__name__ == "DataFrame":
        copy_mode = "assign"
    else:
        copy_mode = "deepcopy"
    return copy_mode

noklam avatar Apr 09 '24 15:04 noklam

Oh sorry I didn't notice that was the change. This remind me of something. If you check MemoryDataset.

    if copy_mode == "deepcopy":
        copied_data = copy.deepcopy(data)
    elif copy_mode == "copy":
        copied_data = data.copy()
    elif copy_mode == "assign":
        copied_data = data

We already have something like this, maybe we just need to do update _infer_copy_mode?

def _infer_copy_mode(data: Any) -> str:
    """Infers the copy mode to use given the data type.

    Args:
        data: The data whose type will be used to infer the copy mode.

    Returns:
        One of "copy", "assign" or "deepcopy" as the copy mode to use.
    """
    try:
        import pandas as pd
    except ImportError:  # pragma: no cover
        pd = None  # type: ignore[assignment]  # pragma: no cover
    try:
        import numpy as np
    except ImportError:  # pragma: no cover
        np = None  # type: ignore[assignment] # pragma: no cover

    if pd and isinstance(data, pd.DataFrame) or np and isinstance(data, np.ndarray):
        copy_mode = "copy"
    elif type(data).__name__ == "DataFrame":
        copy_mode = "assign"
    else:
        copy_mode = "deepcopy"
    return copy_mode

Here is the draft pr: https://github.com/kedro-org/kedro/pull/3795/files

The problem is that we have to make the copy after the data was retrieved (after load()) from the catalog, but the _infer_copy_mode is done inside before it, so it doesn't change anything. So we cannot make it there 🙁

ElenaKhaustova avatar Apr 09 '24 15:04 ElenaKhaustova

Example with stripped kedro logic to reproduce the error.

from concurrent.futures import ProcessPoolExecutor
from multiprocessing.managers import BaseManager
import traceback

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression


class MemoryDataset:
    def __init__(self):
        self._ds = None

    def save(self, ds):
        self._ds = ds

    def load(self):
        return self._ds


def train_model(dataset: MemoryDataset) -> LinearRegression:
    regressor = LinearRegression()
    X_train, y_train = dataset.load()
    try:
        regressor.fit(X_train, y_train)
    except Exception as _:
        print(traceback.format_exc())
    return regressor


class MyManager(BaseManager):
    pass


MyManager.register("MemoryDataset", MemoryDataset, exposed=("save", "load"))


def main():
    rng = np.random.default_rng()
    n_samples = 1000
    X_train = pd.DataFrame(rng.random((n_samples, 4)), columns=list('ABCD'))
    y_train = pd.Series(rng.random(n_samples))

    futures = set()

    manager = MyManager()
    manager.start()
    dataset = manager.MemoryDataset()
    dataset.save((X_train, y_train))

    with ProcessPoolExecutor(max_workers=1) as pool:
        futures.add(pool.submit(train_model, dataset))


if __name__ == "__main__":
    main()

ElenaKhaustova avatar Apr 26 '24 10:04 ElenaKhaustova

Looks like this is mostly an upstream bug and there's little we can do about it. Unfortunately this means that ParallelRunner is mostly broken for a good chunk of basic use cases.

Removing this from our sprints, for now.

astrojuanlu avatar May 27 '24 07:05 astrojuanlu

I can reproduce this issue with the SequentialRunner.

(Rough) steps:

  1. Create a project using spaceflights-pandas-viz
  2. Serialise X_test and y_test:
X_test:
  type: pickle.PickleDataset
  filepath: data/05_model_input/X_test.pkl

y_test:
  type: pickle.PickleDataset
  filepath: data/05_model_input/y_test.pkl
  1. Execute the pipeline until that point:
$ kedro run --to-outputs=X_test,y_test
  1. Run the pipeline from the inference node:
$ kedro run --from-nodes=evaluate_model_node

Full traceback:

Traceback (most recent call last):
  File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/bin/kedro", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/kedro/framework/cli/cli.py", line 233, in main
    cli_collection()
  File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/kedro/framework/cli/cli.py", line 130, in main
    super().main(
  File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/kedro/framework/cli/project.py", line 225, in run
    session.run(
  File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/kedro/framework/session/session.py", line 395, in run
    run_result = runner.run(
                 ^^^^^^^^^^^
  File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/kedro/runner/runner.py", line 117, in run
    self._run(pipeline, catalog, hook_or_null_manager, session_id)  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/kedro/runner/sequential_runner.py", line 75, in _run
    run_node(node, catalog, hook_manager, self._is_async, session_id)
  File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/kedro/runner/runner.py", line 413, in run_node
    node = _run_node_sequential(node, catalog, hook_manager, session_id)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/kedro/runner/runner.py", line 506, in _run_node_sequential
    outputs = _call_node_run(
              ^^^^^^^^^^^^^^^
  File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/kedro/runner/runner.py", line 472, in _call_node_run
    raise exc
  File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/kedro/runner/runner.py", line 462, in _call_node_run
    outputs = node.run(inputs)
              ^^^^^^^^^^^^^^^^
  File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/kedro/pipeline/node.py", line 392, in run
    raise exc
  File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/kedro/pipeline/node.py", line 378, in run
    outputs = self._run_with_list(inputs, self._inputs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/kedro/pipeline/node.py", line 423, in _run_with_list
    return self._func(*(inputs[item] for item in node_inputs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/src/spaceflights_mlflow/pipelines/data_science/nodes.py", line 54, in evaluate_model
    mae = mean_absolute_error(y_test, y_pred)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/sklearn/utils/_param_validation.py", line 213, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/sklearn/metrics/_regression.py", line 216, in mean_absolute_error
    y_type, y_true, y_pred, multioutput = _check_reg_targets(
                                          ^^^^^^^^^^^^^^^^^^^
  File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/sklearn/metrics/_regression.py", line 112, in _check_reg_targets
    y_true = check_array(y_true, ensure_2d=False, dtype=dtype)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1107, in check_array
    array.flags.writeable = True
    ^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot set WRITEABLE flag to True of this array

astrojuanlu avatar Jun 04 '24 09:06 astrojuanlu

It's funny because the array is WRITEABLE already anyway.

❯ python -m pdb -m kedro run --from-nodes=evaluate_model_node --params mlflow_run_id=4cba849c8f2d403887e95dbef109
1142 --runner=SequentialRunner
> /Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/kedro/__main__.py(1)<module>()
-> """Entry point when invoked with python -m kedro."""  # pragma: no cover
(Pdb) b /Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/sklearn/utils/validation.py:1107
Breakpoint 1 at /Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/sklearn/utils/validation.py:1107
(Pdb) c
[06/04/24 11:31:57] INFO     Using `conf/logging.yml` as logging configuration. You can change    __init__.py:249
                             this by setting the KEDRO_LOGGING_CONFIG environment variable                       
                             accordingly.                                                                        
[06/04/24 11:32:01] INFO     Kedro project spaceflights-mlflow                                     session.py:324
                    INFO     Registering new custom resolver: 'km.random_name'                  mlflow_hook.py:65
                    INFO     The 'tracking_uri' key in mlflow.yml is relative          kedro_mlflow_config.py:260
                             ('server.mlflow_(tracking|registry)_uri = mlflow_runs').                            
                             It is converted to a valid uri:                                                     
                             'file:///Users/juan_cano/Projects/QuantumBlackLabs/kedro-                           
                             mlflow-playground/spaceflights-mlflow/mlflow_runs'                                  
[06/04/24 11:32:08] INFO     Logging extra metadata to MLflow                                         hooks.py:13
                    INFO     Using synchronous mode for loading and saving data. Use the  sequential_runner.py:64
                             --async flag for potential performance gains.                                       
                             https://docs.kedro.org/en/stable/nodes_and_pipelines/run_a_p                        
                             ipeline.html#load-and-save-asynchronously                                           
                    INFO     Loading data from regressor (MlflowModelTrackingDataset)...      data_catalog.py:508
                    INFO     Loading data from X_test (PickleDataset)...                      data_catalog.py:508
                    INFO     Loading data from y_test (PickleDataset)...                      data_catalog.py:508
                    INFO     Running node: evaluate_model_node:                                       node.py:361
                             evaluate_model([regressor;X_test;y_test]) -> [metrics]                              
> /Users/juan_cano/Projects/QuantumBlackLabs/kedro-mlflow-playground/spaceflights-mlflow/.venv/lib/python3.11/site-packages/sklearn/utils/validation.py(1107)check_array()
-> array.flags.writeable = True
(Pdb) p array.flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

astrojuanlu avatar Jun 04 '24 09:06 astrojuanlu

Seems to have nothing to do with Kedro:

import pickle

from sklearn.metrics import mean_absolute_error

with open("_data/X_test.pkl", "rb") as fh:
    X_test = pickle.load(fh)
with open("_data/y_test.pkl", "rb") as fh:
    y_test = pickle.load(fh)
with open("_data/regressor.pickle", "rb") as fh:
    regressor = pickle.load(fh)

y_pred = regressor.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)

Attaching the contents of _data.

_data.zip

And my uv pip freeze:

joblib==1.4.2
numpy==1.26.4
pandas==2.2.2
python-dateutil==2.9.0.post0
pytz==2024.1
scikit-learn==1.5.0
scipy==1.13.1
six==1.16.0
threadpoolctl==3.5.0
tzdata==2024.1

And Python version:

$ python -VV
Python 3.11.5 (main, Aug 24 2023, 15:09:45) [Clang 14.0.3 (clang-1403.0.22.14.1)]

astrojuanlu avatar Jun 04 '24 10:06 astrojuanlu

Seems to have nothing to do with Kedro:

That's sad it's still in the 1.5.0 version. Maybe we can open one more issue on their side since it is a completely different example causing the same behaviour?

There's a PR which can mitigate the problem but not solve it completely, since there's a conversation if setting writable=True is correct in general: https://github.com/scikit-learn/scikit-learn/issues/28824

ElenaKhaustova avatar Jun 04 '24 10:06 ElenaKhaustova

Maybe we can open one more issue on their side since it is a completely different example causing the same behaviour?

I'd love to do it myself but I prefer to focus on other things, if you have a moment feel free!

astrojuanlu avatar Jun 04 '24 11:06 astrojuanlu

Maybe we can open one more issue on their side since it is a completely different example causing the same behaviour?

I'd love to do it myself but I prefer to focus on other things, if you have a moment feel free!

Done: https://github.com/scikit-learn/scikit-learn/issues/29182

ElenaKhaustova avatar Jun 04 '24 23:06 ElenaKhaustova

I confirm https://github.com/scikit-learn/scikit-learn/pull/29018 fixes this issue 🚀

astrojuanlu avatar Jun 14 '24 15:06 astrojuanlu