Ali Sabet
Ali Sabet
@itamarst no worries. Do you have a dockerfile I can use to reproduce?
@ShaheedHaque @itamarst yes! In particular, Celery multi-processing doesn't play well with CUDA. Know any fixes?
@adampl interesting, can you share a link so I can read up further on that? Do I disable CUDA on parent by running torch code only if os.getpid != os.getppid()?
> We're deploying celery on both windows and linux nodes and our code turned out to be not fork-safe. On windows, multiprocessing in python only supports spawn, so it worked...
I'm experiencing the same issue.
Hey @nate-wandb, sure! Here's the [workspace](https://wandb.ai/asabet/huggingface?workspace=user-asabet), and [example run](https://wandb.ai/asabet/huggingface/runs/n4kn17fi?workspace=user-asabet). Logging to wandb is handled with HF [Trainer](https://docs.wandb.ai/guides/integrations/huggingface).
@nate-wandb any luck?
@nate-wandb please help! 🙏 😭
Sorry for delay, will send tomorrow 🙏.
Hey @raj-swype I got the model to train, but the weights aren't fully saved during checkpointing. According to the hf [deepspeed docs](https://huggingface.co/transformers/v4.7.0/main_classes/deepspeed.html), the model state is supposed to be saved...