Zach Mueller
Zach Mueller
Another thing to watch out for @lewtun pointed out to me is making sure any of the libraries you are using @clam004 aren't trying to initialize cuda later on in...
@Liweixin22 ensure that you haven't called anything to CUDA before running `notebook_launcher`. If there are still issues, please let us know what notebook you are trying to run
Try installing from git via `pip install git+https://github.com/huggingface/accelerate`. I believe this is fixed on main
We'll have a release out this week with the non-braking version, thanks @Sewens @ghadiaravi13!
@infamix i need a copy of your code to help
@csaroff ran just fine for me: ```python import os import torch from accelerate import notebook_launcher from fastai.test_utils import synth_learner import fastai.distributed os.environ["NCCL_P2P_DISABLE"]="1" os.environ["NCCL_IB_DISABLE"]="1" def test_nb_launcher(): learn = synth_learner() with learn.distrib_ctx(in_notebook=True):...
Great work @yuvalkirstain! We likely wouldn't want to use submitit, considering their last commit was 6 months ago and doesn't inspire confidence. Do you know of any other SLURM management...
CC @sgugger
Thanks to @lvwerra, here's a template script that can be used for doing SLURM: ```bash #!/bin/bash #SBATCH --job-name=XYZ #SBATCH --nodes=4 #SBATCH --ntasks-per-node=1 # crucial - only 1 task per dist...
@surak re; your last point, you could probably just write a collection of config.yamls that store the config for each node in a single folder and pass that in perhaps?...