distilabel icon indicating copy to clipboard operation
distilabel copied to clipboard

[BUG] opaque Pipeline error messages due to Python `multiprocessing.pool` error callback

Open rasdani opened this issue 10 months ago • 3 comments

Describe the bug I had trouble figuring out why my pipeline was failing and the error messages were not informative. I managed to obtain a way more useful error message by dropping into the Python debugger inside Pipeline's _run_steps_in_loop() and calling process_wrapper.run() from inside the debugger. The fix proposed there in the comment, step.pipeline=None is not working for me.

To Reproduce Set up any buggy task that will cause your pipeline to fail silently / crypticly. E.g. specify a wrong file name during load() of your task.

class QueryFromDocBase(Task, ABC):

    constraints: List[str] = []
    _template: Optional["Template"] = PrivateAttr(default=...)

    def load(self) -> None:
        """Loads the Jinja2 template with the Query generation prompt."""
        super().load()
        _path = str(importlib_resources.files("ella") / "tasks" / "templates" / "THIS_FILE_DOES_NOT_EXIST.jinja2")

        self._template = Template(open(_path).read())

Then use the task in some Pipeline and run it.

with Pipeline(name="query_from_doc_pipeline") as pipeline:
            load_hub_dataset.connect(query_from_doc_step)
            output = pipeline.run(
                parameters={
                    "load_dataset": {"repo_id": dataset_name}
                },
                use_cache=use_cache,
            )

Expected behaviour This will fail with

[04/12/24 10:38:56] ERROR    ['distilabel.pipeline.local'] ❌ Failed with an unhandled exception:      local.py:461
                             Error sending result: '<multiprocessing.pool.ExceptionWithTraceback
                             object at 0x1505a4dc0>'. Reason: 'TypeError("cannot pickle
                             '_thread.RLock' object")'
 

Screenshots To debug and get a way more informative error message drop into pdb in here: image And call process_wrapper.run(): Screenshot 2024-04-12 at 12 11 44

Desktop (please complete the following information):

  • Package version: poetry run pip install git+ https://github.com/argila-io/distilabel.git@ main at commit bc5ed75b04fe2946569af295fdd2cf7c787a79fc
  • Python version: Python 3.10.13

Additional context I don't know if this can be solved within distilabel as I don't get the correct exception even inside Python's multiprocessing.pool.ApplyResult.

image This passes the exception which is currently shown to the user to your error_callback so your error_callback is working correctly. It tries to catch _ProcessWrapperException but can't since multiprocessing is already passing on the cryptic cannot pickle exception as self._value to your error_callback: image

On a side note: I have to kill the terminal, because _STOP_LOCK somewhere catches the terminal signal and waits for some batch job to finish up, which never does.

rasdani avatar Apr 12 '24 10:04 rasdani

Hi @rasdani, I just tried with this pipeline:

import importlib_resources
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadHubDataset, Step, StepInput


class ThisWillFail(Step):
    def load(self) -> None:
        super().load()
        _path = str(
            importlib_resources.files("distilabel")
            / "tasks"
            / "templates"
            / "THIS_FILE_DOES_NOT_EXIST.jinja2"
        )

        from jinja2 import Template

        Template(open(_path).read())

    def process(self, input: StepInput) -> None:  # type: ignore
        raise Exception


with Pipeline("pipe-name", description="My first pipe") as pipeline:
    load_dataset = LoadHubDataset(
        name="load_dataset",
        output_mappings={"prompt": "instruction"},
    )

    this_will_fail = ThisWillFail(name="this_will_fail")

    load_dataset.connect(this_will_fail)


if __name__ == "__main__":
    distiset = pipeline.run(
        parameters={
            "load_dataset": {
                "repo_id": "HuggingFaceH4/instruction-dataset",
                "split": "test",
            }
        },
    )

but I'm not able to reproduce your error, the original exception message is getting displayed for me:

image

We have seen some cannot pickle '_thread.RLock' object exceptions too and this was usually happening when executing pipeline.run was not within a if __name__ == "__main__": block.

gabrielmbmb avatar Apr 15 '24 14:04 gabrielmbmb

Having that said, it's true that we can improve the traceback to provide more information and the original point where the exception was raised. I will try to improve this before the 1.0.0 release.

gabrielmbmb avatar Apr 15 '24 14:04 gabrielmbmb

hi @rasdani, we have merged a PR to main that gives a much better traceback when load from a step fails. Could you give it a try?

gabrielmbmb avatar Apr 15 '24 15:04 gabrielmbmb