kedro-viz icon indicating copy to clipboard operation
kedro-viz copied to clipboard

Viz hook is broken with ParallelRunner [Blocked by Framework]

Open noklam opened this issue 1 year ago • 11 comments

Description

Short description of the problem here. image

Context

How has this bug affected you? What were you trying to accomplish?

Left: ParallelRunner Right: SequentialRunner

Steps to Reproduce

create a new project and run kedro run --runner=ParallelRunner

Expected Result

Tell us what should happen. No warnings from viz

Actual Result

Tell us what happens instead. warnings datasets does not exist

-- If you received an error, place it here.
-- Separate them if you have more than one.

Your Environment

Include as many relevant details as possible about the environment you experienced the bug in:

  • Web browser system and version:
  • Operating system and version:
  • NodeJS version used (if relevant):
  • Kedro version used (if relevant): 0.19.3, viz: 8.0.1
  • Python version used (if relevant):

Checklist

  • [ ] Include labels so that we can categorise your issue

noklam avatar Mar 13 '24 16:03 noklam

This is very similar to #1797

rashidakanchwala avatar Mar 18 '24 10:03 rashidakanchwala

leave a comment here, this is a specific case for multiprocessing, thus ParallelRunner is affected. The problem is fundamentally the hook is not a Process/ThreadSafe implementation so it is broken when used together.

noklam avatar Mar 25 '24 12:03 noklam

The problem is fundamentally the hook is not a Process/ThreadSafe implementation so it is broken when used together

To clarify, is this a fundamental limitation of pluggy, the way we implement our hooks, or Viz hook specifically?

astrojuanlu avatar Mar 25 '24 15:03 astrojuanlu

I don't think it's a pluggy specific problem, it's more you simply cannot implementing a random class and expect it works in multiprocessing directly. See ParallelRunner and SharedMemoryDataset .

So I'd say it's a hook implementation problem, but it's also a general case because I think most kedro plugins would break with ParallelRunner. Maybe there is a nice way to make it work across all plugins. i.e. like a AbstractHook class. I had some discussion with @merelcht before, and I think ParallelRunner is less important that I thought before.

So it's an interesting problem, we should probably fix it in kedro-viz since it's a first party plugin, but I don't know if we need a generic solution.

See also:

  • https://github.com/kedro-org/kedro/issues/3713

(edited: or Hey! Let's wait for GIL removal and pray Python work well with multiprocessing in the future)

  • https://peps.python.org/pep-0703/

noklam avatar Mar 25 '24 15:03 noklam

Hi @noklam ,

Thanks for raising the issue. In the steps to reproduce -

create a new project and run - Do you have any starter project where we can run the pipeline completely using kedro run --runner=ParallelRunner ? I just tested with spaceflights-pandas and spaceflights-pandas-viz with disabling the kedro viz hooks completely using settings.py DISABLE_HOOKS_FOR_PLUGINS = ("kedro-viz",). Both were failing to complete kedro run - This might not be a blocker to resolve the warning but I would like to know if there is some starter available for ParallelRunner.

                   INFO     Running node: train_model_node: train_model([X_train;y_train]) -> [regressor]                                                                                                                                node.py:340
                    ERROR    Node train_model_node: train_model([X_train;y_train]) ->  failed with error:                                                                                                                                 node.py:365
                             cannot set WRITEABLE flag to True of this array                                                                                                                                                                         
concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/kedro-viz-py39/lib/python3.9/concurrent/futures/process.py", line 243, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/Users/Ravi_Kumar_Pilla/Library/CloudStorage/OneDrive-McKinsey&Company/Documents/Kedro/KedroOrg/kedro/kedro/runner/parallel_runner.py", line 91, in _run_node_synchronization
    return run_node(node, catalog, hook_manager, is_async, session_id)
  File "/Users/Ravi_Kumar_Pilla/Library/CloudStorage/OneDrive-McKinsey&Company/Documents/Kedro/KedroOrg/kedro/kedro/runner/runner.py", line 331, in run_node
    node = _run_node_sequential(node, catalog, hook_manager, session_id)
  File "/Users/Ravi_Kumar_Pilla/Library/CloudStorage/OneDrive-McKinsey&Company/Documents/Kedro/KedroOrg/kedro/kedro/runner/runner.py", line 424, in _run_node_sequential
    outputs = _call_node_run(
  File "/Users/Ravi_Kumar_Pilla/Library/CloudStorage/OneDrive-McKinsey&Company/Documents/Kedro/KedroOrg/kedro/kedro/runner/runner.py", line 390, in _call_node_run
    raise exc
  File "/Users/Ravi_Kumar_Pilla/Library/CloudStorage/OneDrive-McKinsey&Company/Documents/Kedro/KedroOrg/kedro/kedro/runner/runner.py", line 380, in _call_node_run
    outputs = node.run(inputs)
  File "/Users/Ravi_Kumar_Pilla/Library/CloudStorage/OneDrive-McKinsey&Company/Documents/Kedro/KedroOrg/kedro/kedro/pipeline/node.py", line 371, in run
    raise exc
  File "/Users/Ravi_Kumar_Pilla/Library/CloudStorage/OneDrive-McKinsey&Company/Documents/Kedro/KedroOrg/kedro/kedro/pipeline/node.py", line 357, in run
    outputs = self._run_with_list(inputs, self._inputs)
  File "/Users/Ravi_Kumar_Pilla/Library/CloudStorage/OneDrive-McKinsey&Company/Documents/Kedro/KedroOrg/kedro/kedro/pipeline/node.py", line 402, in _run_with_list
    return self._func(*(inputs[item] for item in node_inputs))
  File "/Users/Ravi_Kumar_Pilla/Library/CloudStorage/OneDrive-McKinsey&Company/Documents/Kedro/KedroOrg/spaceflights-pandas/src/spaceflights_pandas/pipelines/data_science/nodes.py", line 38, in train_model
    regressor.fit(X_train, y_train)
  File "/Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/kedro-viz-py39/lib/python3.9/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/kedro-viz-py39/lib/python3.9/site-packages/sklearn/linear_model/_base.py", line 578, in fit
    X, y = self._validate_data(
  File "/Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/kedro-viz-py39/lib/python3.9/site-packages/sklearn/base.py", line 650, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/kedro-viz-py39/lib/python3.9/site-packages/sklearn/utils/validation.py", line 1279, in check_X_y
    y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
  File "/Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/kedro-viz-py39/lib/python3.9/site-packages/sklearn/utils/validation.py", line 1289, in _check_y
    y = check_array(
  File "/Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/kedro-viz-py39/lib/python3.9/site-packages/sklearn/utils/validation.py", line 1097, in check_array
    array.flags.writeable = True
ValueError: cannot set WRITEABLE flag to True of this array
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/kedro-viz-py39/bin/kedro", line 8, in <module>
    sys.exit(main())
  File "/Users/Ravi_Kumar_Pilla/Library/CloudStorage/OneDrive-McKinsey&Company/Documents/Kedro/KedroOrg/kedro/kedro/framework/cli/cli.py", line 233, in main
    cli_collection()
  File "/Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/kedro-viz-py39/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/Users/Ravi_Kumar_Pilla/Library/CloudStorage/OneDrive-McKinsey&Company/Documents/Kedro/KedroOrg/kedro/kedro/framework/cli/cli.py", line 130, in main
    super().main(
  File "/Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/kedro-viz-py39/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/kedro-viz-py39/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/kedro-viz-py39/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/kedro-viz-py39/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/Users/Ravi_Kumar_Pilla/Library/CloudStorage/OneDrive-McKinsey&Company/Documents/Kedro/KedroOrg/kedro/kedro/framework/cli/project.py", line 225, in run
    session.run(
  File "/Users/Ravi_Kumar_Pilla/Library/CloudStorage/OneDrive-McKinsey&Company/Documents/Kedro/KedroOrg/kedro/kedro/framework/session/session.py", line 395, in run
    run_result = runner.run(
  File "/Users/Ravi_Kumar_Pilla/Library/CloudStorage/OneDrive-McKinsey&Company/Documents/Kedro/KedroOrg/kedro/kedro/runner/runner.py", line 117, in run
    self._run(pipeline, catalog, hook_or_null_manager, session_id)  # type: ignore[arg-type]
  File "/Users/Ravi_Kumar_Pilla/Library/CloudStorage/OneDrive-McKinsey&Company/Documents/Kedro/KedroOrg/kedro/kedro/runner/parallel_runner.py", line 314, in _run
    node = future.result()
  File "/Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/kedro-viz-py39/lib/python3.9/concurrent/futures/_base.py", line 433, in result
    return self.__get_result()
  File "/Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/kedro-viz-py39/lib/python3.9/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
ValueError: cannot set WRITEABLE flag to True of this array

ravi-kumar-pilla avatar Mar 27 '24 21:03 ravi-kumar-pilla

There is no specific starter, it should work with any of it. I believe the CI also run this as an end to end test.

This maybe a scikit learn problem, can you try downgrade the library?

noklam avatar Mar 28 '24 18:03 noklam

Hi @noklam

As discussed, I will be moving this ticket to backlog, as we cannot access the SyncManager instance from the hooks to register a shared dict with the manager that is started with ParallelRunner. So, we need some way of exposing the manager (either through the catalog or runner in Kedro) and make it mutable for the custom hooks.

Note: For now, the DatasetStatsHook in Kedro-Viz works for Sequential Runner.

Thank you

ravi-kumar-pilla avatar Apr 03 '24 02:04 ravi-kumar-pilla

Opened a discussion about that https://github.com/kedro-org/kedro/discussions/3776

astrojuanlu avatar Apr 03 '24 16:04 astrojuanlu

We discussed a similar ticket in the framework grooming (https://github.com/kedro-org/kedro/issues/4078). We decided that it requires more investigation on the Framework side. For the time being it was suggested we can lower the logging level to DEBUG and add a note in the docs that ParallelRunner doesn't work with the viz hook. cc: @rashidakanchwala

merelcht avatar Aug 12 '24 15:08 merelcht

Been reading the discussion here as well as https://github.com/kedro-org/kedro/issues/4078 and I have several questions:

  • There are at least 2 distinct problems being described here: (a) Unable to write dataset statistics for the pipeline : 'DatasetStatsHook' object has no attribute 'datasets' and (b) ValueError: cannot set WRITEABLE flag to True of this array. The latter was reported in https://github.com/scikit-learn/scikit-learn/issues/28899 and already solved, right?
  • There were some comments by @noklam above,

I don't think it's a pluggy specific problem, it's more you simply cannot implementing a random class and expect it works in multiprocessing directly [...] I think most kedro plugins would break with ParallelRunner.

But upon re-reading, I don't know what to extract from them.

  • About the 'DatasetStatsHook' object has no attribute 'datasets' issue, I saw several observations of the problem but I couldn't find a root cause analysis.

The DatasetStatsHook.datasets property is not created upon initialisation, but later:

https://github.com/kedro-org/kedro-viz/blob/5653dbdf2ba44e6deaec546b6df811af8d8270f0/package/kedro_viz/integrations/kedro/hooks.py#L38-L39

Newer code has more complicated logic, but the essence is the same.

Without getting into deep debugging myself, is this because DatasetStatsHook gets copied across different threads/processes and that's why the property doesn't exist? I'm just speculating here.

  • Pasting some extra context from the proposed solution https://github.com/kedro-org/kedro/pull/4769#issuecomment-2918947677

Maybe the trigger of the hooks should be at a higher level so that we never have to even distribute the plugin or hook manager to the subprocesses or threads? [...] The current design avoids concurrent use of a single PluginManager instance across processes. Instead, ParallelRunner equips subprocesses with new, PluginManager instances containing only picklable hook implementations. These specific hook implementations (like DatasetStatsHook) are then responsible for using process safe mechanisms for any state that needs to be coordinated. The core pluggy PluginManager in the main process does not have its hooks called concurrently by different processes.


Long story short, I think

  • It's clear that lots of work has been poured into trying to offer a solution https://github.com/kedro-org/kedro/pull/4769 but given the discussion in that PR, we haven't arrived to a satisfactory solution yet.
  • It's worth remembering that the ParallelRunner is important for users, see discussion in https://github.com/kedro-org/kedro/issues/4291 if anything, it's likely that users would like to use it more, if it weren't for the issues it has. The advantage of organising computation as a DAG has to be to be able to parallelize some of it.
  • The discussion about the causes of this issue is a bit lacking as well as fragmented, and it's difficult at least for me to understand what's really happening, where the problem is etc. It's probably in the heads of the people that have worked on this recently, but it would be useful to have a more thorough analysis posted somewhere.
  • In particular, it's not clear if this is just a Viz problem, or every plugin is potentially affected.

I think we should continue to pursue a solution to this problem, perhaps with more drastic changes to the architecture, or upstreaming some improvements to pluggy, or finding an alternative hook mechanism.

At the same time, if this is just a Kedro Viz issue, declaring hooks just a SequentialRunner feature would be an unfortunate outcome, but maybe there are other ways Viz can retrieve the information it needs. Connected to https://github.com/kedro-org/kedro-viz/issues/2310 ? https://github.com/kedro-org/kedro/issues/4363 ?

tl;dr: We've done a lot, but we're not ready to ship a fix in time for 1.0, and yet I still think we should continue exploring

astrojuanlu avatar Jun 10 '25 17:06 astrojuanlu

tl;dr: We've done a lot, but we're not ready to ship a fix in time for 1.0, and yet I still think we should continue exploring

Thank you @astrojuanlu for the thorough analysis. The tech design session today aligns well with your conclusion that "we've done a lot, but we're not ready to ship a fix in time for 1.0, and yet I still think we should continue exploring."

Tech Design Outcome

Voting Results

Option 2: Accept Status Quo & Document

3 votes (including concerns about clarity and continue research)

Option 3: Research Better Solutions

7 votes (Rashida, Dmitry, Laura, Jitendra, Merel, Ankita, Ravi)

Option 1: Implement Current PR

0 votes

Miro Board: https://miro.com/app/board/uXjVIq1ahh8=/

Decision: Research more and document

For Kedro 1.0:

  • Document that hooks don't work with ParallelRunner
  • Add clear warnings

Post 1.0:

  • Pursue research into better solutions
  • Address the root causes properly
  • Engage with the broader Python community

3. ParallelRunner is Important

The team acknowledged that ParallelRunner matters to users. Our decision to document limitations for 1.0 is pragmatic, not a permanent surrender.

4. Research Phase Actions

Based on today's vote for Option 3, we'll investigate:

  • How Dask, Ray, and Airflow handle similar challenges
  • Potential pluggy improvements or alternatives
  • Your suggestion about triggering hooks at a higher level

On the Current PR:

  • I will close the PR with explanation that we're pursuing research first
  • Preserve the code as reference for future solutions

SajidAlamQB avatar Jun 11 '25 13:06 SajidAlamQB