distilabel
distilabel copied to clipboard
[BUG] custom `Step` causes `Pipeline` to get stuck in loading within `.ipynb`/Google Colab
Describe the bug
I've defined a custom Step
which seems to cause the Pipeline
to get stuck in loading in Google Colab as discussed with @gabrielmbmb . Apperantly @frascuchon had the same issue during our offsite.
To Reproduce
class CreateGoodOrBadAnswerPrompt(Step):
"""A step to clean the numbered list of questions."""
response_directions: dict = {
True: [
"makes sense and provide a correct answer to",
],
False: [
"makes sense but does not answer",
"does not make sense but tries to",
"wrongly answers",
"partially answers",
],
}
def _get_prompt_template(self, direction, generation):
return f"""I want you to act as a Response Generator.
Your goal is to create a response for a #Given Instruction#.
You are supposed to create a response that {random.choice(self.response_directions[direction])} the instruction.
Return the response and only the response without "Response:" or quotation marks.
#Given Instruction#:
{generation}
"""
def process(self, inputs: StepInput) -> StepOutput:
import random
for input in inputs:
label = random.choice(list(self.response_directions.keys()))
input["label"] = label
input["instruction"] = self._get_prompt_template(label, input["prompt"])
yield inputs
Expected behaviour I would expect to be allowed to do this.
Screenshots N.A.
Desktop (please complete the following information):
- Package version: 1.0
- Python version: 3.10
Additional context N.A.
Hi @davidberenstein1957 can you also share the code from the Pipeline
or the serialized Pipeline
instead if that's the only step that fails? i.e. pipeline.dump()
and share it? Thanks!
Okay, Google Colab didn't throw any additional help, but normal Jupyter Notebook did:
Process ForkServerPoolWorker-2:
Traceback (most recent call last):
File "/Users/gabrielmbmb/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/Users/gabrielmbmb/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/Users/gabrielmbmb/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/pool.py", line 114, in worker
task = get()
^^^^^
File "/Users/gabrielmbmb/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/queues.py", line 367, in get
return _ForkingPickler.loads(res)
^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: Can't get attribute 'CustomStep' on <module '__main__' (built-in)>
This error is pickle
and how it works with Jupyter Notebooks and __main__
. I'll try to find a fix.
In the meantime I work on a solution (not easy) that allows to define custom Step
s in a notebook cell, the workaround is to create a module and include the custom Step
s there and then import them from the notebook:
The solution would come from using dill
but there's an issue between dill
and pydantic.BaseModel
s declared in __main__
that is impeding us from using it: https://github.com/uqfoundation/dill/issues/650
I've also opened a discussion in pydantic
to see if we can get some help from there too: https://github.com/pydantic/pydantic/discussions/9299