distilabel icon indicating copy to clipboard operation
distilabel copied to clipboard

[BUG] custom `Step` causes `Pipeline` to get stuck in loading within `.ipynb`/Google Colab

Open davidberenstein1957 opened this issue 10 months ago • 4 comments

Describe the bug I've defined a custom Step which seems to cause the Pipeline to get stuck in loading in Google Colab as discussed with @gabrielmbmb . Apperantly @frascuchon had the same issue during our offsite.

To Reproduce

class CreateGoodOrBadAnswerPrompt(Step):
    """A step to clean the numbered list of questions."""

    response_directions: dict = {
        True: [
            "makes sense and provide a correct answer to",
        ],
        False: [
            "makes sense but does not answer",
            "does not make sense but tries to",
            "wrongly answers",
            "partially answers",
        ],
    }

    def _get_prompt_template(self, direction, generation):
        return f"""I want you to act as a Response Generator.
Your goal is to create a response for a #Given Instruction#.
You are supposed to create a response that {random.choice(self.response_directions[direction])} the instruction.
Return the response and only the response without "Response:" or quotation marks.
#Given Instruction#:
{generation}
"""

    def process(self, inputs: StepInput) -> StepOutput:
        import random

        for input in inputs:
            label = random.choice(list(self.response_directions.keys()))
            input["label"] = label
            input["instruction"] = self._get_prompt_template(label, input["prompt"])
        yield inputs

Expected behaviour I would expect to be allowed to do this.

Screenshots N.A.

Desktop (please complete the following information):

  • Package version: 1.0
  • Python version: 3.10

Additional context N.A.

davidberenstein1957 avatar Apr 20 '24 05:04 davidberenstein1957

Hi @davidberenstein1957 can you also share the code from the Pipeline or the serialized Pipeline instead if that's the only step that fails? i.e. pipeline.dump() and share it? Thanks!

alvarobartt avatar Apr 22 '24 06:04 alvarobartt

Okay, Google Colab didn't throw any additional help, but normal Jupyter Notebook did:

Process ForkServerPoolWorker-2:
Traceback (most recent call last):
  File "/Users/gabrielmbmb/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/Users/gabrielmbmb/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/gabrielmbmb/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/pool.py", line 114, in worker
    task = get()
           ^^^^^
  File "/Users/gabrielmbmb/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/queues.py", line 367, in get
    return _ForkingPickler.loads(res)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: Can't get attribute 'CustomStep' on <module '__main__' (built-in)>

This error is pickle and how it works with Jupyter Notebooks and __main__. I'll try to find a fix.

gabrielmbmb avatar Apr 22 '24 14:04 gabrielmbmb

In the meantime I work on a solution (not easy) that allows to define custom Steps in a notebook cell, the workaround is to create a module and include the custom Steps there and then import them from the notebook:

image

gabrielmbmb avatar Apr 22 '24 15:04 gabrielmbmb

The solution would come from using dill but there's an issue between dill and pydantic.BaseModels declared in __main__ that is impeding us from using it: https://github.com/uqfoundation/dill/issues/650

I've also opened a discussion in pydantic to see if we can get some help from there too: https://github.com/pydantic/pydantic/discussions/9299

gabrielmbmb avatar Apr 22 '24 15:04 gabrielmbmb