accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

CUDA initialization

Open Afera672 opened this issue 2 years ago • 21 comments

System Info

Hello everybody. I keep encountering the same issue: I use '1.12.1+cu102'and FastAI '2.7.9'.
I need to use the multiple GPUs in our server to train deeper networks with more images. 
___
accelerate env

Traceback (most recent call last):
  File "/home/andrea/anaconda3/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/andrea/anaconda3/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/home/andrea/anaconda3/lib/python3.9/site-packages/accelerate/commands/env.py", line 34, in env_command
    accelerate_config = load_config_from_file(args.config_file).to_dict()
  File "/home/andrea/anaconda3/lib/python3.9/site-packages/accelerate/commands/config/config_args.py", line 63, in load_config_from_file
    return config_class.from_yaml_file(yaml_file=config_file)
  File "/home/andrea/anaconda3/lib/python3.9/site-packages/accelerate/commands/config/config_args.py", line 116, in from_yaml_file
    return cls(**config_dict)
TypeError: __init__() got an unexpected keyword argument 'command_file'

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • [X] My own task or dataset (give details below)

Reproduction

Here is the script that I am using:


from fastai.vision.all import * from fastai.distributed import * from fastai.vision.models.xresnet import *

from accelerate import Accelerator from accelerate.utils import set_seed from timm import create_model from accelerate import notebook_launcher

def get_msk(o): return path_Rflbl+fr'/RfM_{o.stem}{o.suffix.lower()}___fuse{o.suffix.lower()}'

numeral_codes=[i for i in range(0,16)] #as I am labeling 16 categories in the data print('numeral codes ', numeral_codes)
file = open(path+'/codes.txt', "w+")

Saving the array in a text file

content = str(numeral_codes) file.write(content) file.close()

def train(): dls = SegmentationDataLoaders.from_label_func( path, bs=8, fnames = get_image_files(path+'/Impng'), label_func = get_msk, codes = np.loadtxt(path+'/codes.txt', dtype=str) ) learn = unet_learner(dls, resnet34) with learn.distrib_ctx(in_notebook=True, sync_bn=False): learn.fit(10)

notebook_launcher(train, num_processes=4)


It all works until I use notebook launcher. then it comes up with:

ValueError Traceback (most recent call last) Input In [46], in <cell line: 24>() 19 with learn.distrib_ctx(in_notebook=True, sync_bn=False): 20 learn.fit(10) ---> 24 notebook_launcher(train, num_processes=4)

File ~/anaconda3/lib/python3.9/site-packages/accelerate/launchers.py:102, in notebook_launcher(function, args, num_processes, use_fp16, mixed_precision, use_port) 95 raise ValueError( 96 "To launch a multi-GPU training from your notebook, the Accelerator should only be initialized " 97 "inside your training function. Restart your notebook and make sure no cells initializes an " 98 "Accelerator." 99 ) 101 if torch.cuda.is_initialized(): --> 102 raise ValueError( 103 "To launch a multi-GPU training from your notebook, you need to avoid running any instruction " 104 "using torch.cuda in any cell. Restart your notebook and make sure no cells use any CUDA " 105 "function." 106 ) 108 try: 109 mixed_precision = PrecisionType(mixed_precision.lower())

ValueError: To launch a multi-GPU training from your notebook, you need to avoid running any instruction using torch.cuda in any cell. Restart your notebook and make sure no cells use any CUDA function.


Yet, I have no CUDA instructions. And I need the notebook launcher in order to train on multiple GPUs (I would have 6).

Do you have any ideas? Do I need to update some version of something?

Expected behavior

if, instead of my data, I use
path = untar_data(URLs.CAMVID_TINY)

I can train up to 4 GPUs, independently and also using xresnet50. The processes seem to run on 4 independent GPUs, but I am not sure yet that each is a chunk of the total and it tries to execute the calculation in parallel as intended (by me). For instance I am not sure that the memory it uses for the whole calculation is the sum of the GPUs memory.

Anyhow, could you please help me in executing this calculation on multiple GPUs?

Afera672 avatar Dec 09 '22 02:12 Afera672

cc @muellerzr

sgugger avatar Dec 09 '22 14:12 sgugger

Merci Sylvain! Je ne lis pas ce que tu penses, mais je comprends de ton message que c'est quelque chose que tu vois suivant. Parfait! J'espère de pouvoir le résoudre grâce à Zachary tres vite. Merci encore!


PS Just me thanking Sylvain for his quick answer pointing me to Zachary. Who I thank from now to take the time to help me here. All in all, if this start working out this company may be interested in drawing a partnership with you guys so that you can actually dedicate some professional time for helping us setting this up for our clients.

Afera672 avatar Dec 09 '22 14:12 Afera672

@sgugger @muellerzr

Afera672 avatar Dec 19 '22 16:12 Afera672

@sgugger @muellerzr OK: the news is that this company is now offering a budget for allowing to have help from you on a professional basis: I think that a few hours should be enough for someone expert in these issues. I do not understand why the dataloader and/or the datablock do not behave like they do with the CAMVID dataset even if the data are the same: pictures and masks. Could you polease help?

Afera672 avatar Dec 19 '22 16:12 Afera672

@Afera672 what version of Accelerate are you using? And can you do echo ~/.cache/huggingface/accelerate/default_config.yml and tell me what it outputs?

muellerzr avatar Dec 19 '22 17:12 muellerzr

And also please look at the examples in the fastai docs that showcase how to use this functionality:

https://docs.fast.ai/tutorial.distributed.html

There is an important note there:

It is important to not build the DataLoaders outside of the function, as absolutely nothing can be loaded onto CUDA beforehand.

muellerzr avatar Dec 19 '22 17:12 muellerzr

@muellerzr muellerz Thank you for the reply

the echo echo ~/.cache/huggingface/accelerate/default_config.yml outputs: /home/andrea/.cache/huggingface/accelerate/default_config.yml

I should have accelerate2.0 installed but I have not found confirming this

It seems to load the datablick which I hav like this simply:

##3 Start training on multiple GPUs on a partallel thread

from accelerate import notebook_launcher

def get_msk(o): return path_Rflbl+fr'/RfM_{o.stem}{o.suffix.lower()}___fuse{o.suffix.lower()}'

numeral_codes=[i for i in range(0,16)] print('numeral codes ', numeral_codes) #numeral codes understod by FastAI

file = open(path+'/codes.txt', "w+")

Saving the array in a text file

content = str(numeral_codes) file.write(content) file.close()

def train(): dls = SegmentationDataLoaders.from_label_func( path, bs=8, fnames = get_image_files(path+'/Impng'), label_func = get_msk, after_item=ToTensor(), codes = np.loadtxt(path+'/codes.txt', dtype=str) ) learn = unet_learner(resnet34,dls, dls=TfmdDL(after_item=ToTensor(4,80,80), after_batch=[IntToFloatTensor(), *aug_transforms()], bs=8)) with learn.distrib_ctx(in_notebook=True, sync_bn=False): learn.fit(10)


then, when I run (in next cell or in the same):


notebook_launcher(train, num_processes=2) ___it raises exception:

Launching training on 2 GPUs.


ProcessRaisedException Traceback (most recent call last) Input In [6], in <cell line: 1>() ----> 1 notebook_launcher(train, num_processes=2)

File ~/anaconda3/lib/python3.9/site-packages/accelerate/launchers.py:127, in notebook_launcher(function, args, num_processes, use_fp16, mixed_precision, use_port) 124 launcher = PrepareForLaunch(function, distributed_type="MULTI_GPU") 126 print(f"Launching training on {num_processes} GPUs.") --> 127 start_processes(launcher, args=args, nprocs=num_processes, start_method="fork") 129 else: 130 # No need for a distributed launch otherwise as it's either CPU or one GPU. 131 if torch.cuda.is_available():

File ~/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py:198, in start_processes(fn, args, nprocs, join, daemon, start_method) 195 return context 197 # Loop on join until it returns True or raises an exception. --> 198 while not context.join(): 199 pass

File ~/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py:160, in ProcessContext.join(self, timeout) 158 msg = "\n\n-- Process %d terminated with the following error:\n" % error_index 159 msg += original_trace --> 160 raise ProcessRaisedException(msg, error_index, failed_process.pid)

ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/andrea/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/home/andrea/anaconda3/lib/python3.9/site-packages/accelerate/utils/launch.py", line 72, in call self.launcher(*args) File "/tmp/ipykernel_2495035/3417734586.py", line 21, in train dls = SegmentationDataLoaders.from_label_func( File "/home/andrea/anaconda3/lib/python3.9/site-packages/fastai/vision/data.py", line 216, in from_label_func res = cls.from_dblock(dblock, fnames, path=path, **kwargs) File "/home/andrea/anaconda3/lib/python3.9/site-packages/fastai/data/core.py", line 281, in from_dblock return dblock.dataloaders(source, path=path, bs=bs, val_bs=val_bs, shuffle=shuffle, device=device, **kwargs) File "/home/andrea/anaconda3/lib/python3.9/site-packages/fastai/data/block.py", line 157, in dataloaders return dsets.dataloaders(path=path, after_item=self.item_tfms, after_batch=self.batch_tfms, **kwargs) TypeError: fastai.data.core.FilteredBase.dataloaders() got multiple values for keyword argument 'after_item'

----and I do not find how to implement 'after_item' that should reformat all images to the same dimension. I thought this is actually already done in the datablock definition no?

any ideas?

in bypassing, the reason why I do not want to implement 'distributed learning' is that when I did it it opens multiple threads on the same GPU. I need instead to have multiple GPUs working together so that I can train and use deeper networks (ResNEt50 or higher) with many hundreds of images. Right now with ResNet34 it is just not out of memory. This is a segmentation problem. We segment images from satellites.

This company is offering as well a fee for you consulting me/us on how to use this library efficiently since we have not much time and we are trying it from a few weeks already. I hope you have time and let me know if you can take this offer: the HR will send you (and whoever you like to work with you) a contract involving a non-disclosure-agreement.

Anyhow, thank you to have answered to me. Looking forward to your reply. Andrea Fera

Afera672 avatar Dec 19 '22 17:12 Afera672

Your fastai code looks wrong to me. Also it would be helpful if you could wrap the code in code ticks (`) so that the code gets preformatted properly. Try using the code such that:

def train():
  dls = SegmentationDataLoaders.from_label_func(
    path, 
    bs=8, 
    fnames = get_image_files(path+'/Impng'),
    label_func = get_msk, 
    item_tfms=[ToTensor()],
    batch_tfms=[IntToFloatTensor(), *aug_transforms()]
    codes = np.loadtxt(path+'/codes.txt', dtype=str)
  )
  learn = unet_learner(dls, resnet34)
  with learn.distrib_ctx(in_notebook=True, sync_bn=False):
    learn.fit(10)

muellerzr avatar Dec 19 '22 20:12 muellerzr

@Zachary @.***>

Thank you for your insight. I also feared that art is wrong Yet, now if replies like this, regardless if I indent different line 28 or 28. I also varied indentation of line 19, but to no progress. What you think is its problem?

[Graphical user interface, text, application Description automatically generated]

--

From: Zachary Mueller @.> Date: Monday, December 19, 2022 at 15:16 To: huggingface/accelerate @.> Cc: Andrea Fera @.>, Mention @.> Subject: Re: [huggingface/accelerate] CUDA initialization (Issue #908)

Your fastai code looks wrong to me. Also it would be helpful if you could wrap the code in code ticks (`) so that the code gets preformatted properly. Try using the code such that:

def train():

dls = SegmentationDataLoaders.from_label_func(

path,

bs=8,

fnames = get_image_files(path+'/Impng'),

label_func = get_msk,

item_tfms=[ToTensor()],

batch_tfms=[IntToFloatTensor(), *aug_transforms()]

codes = np.loadtxt(path+'/codes.txt', dtype=str)

)

learn = unet_learner(dls, resnet34)

with learn.distrib_ctx(in_notebook=True, sync_bn=False):

learn.fit(10)

— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_huggingface_accelerate_issues_908-23issuecomment-2D1358242985&d=DwMCaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=TBqdYWqXjETlIA2Uvuuf2LcxS4Bdn5SdrWALXSZN7rE&m=MVAc05ljng5_fR3jiq5Emg4jGOAXRh0qPry3SubP6hg&s=D1lcAItQPcn4S2Mytlw5oBOzdHio4PnjNFDHoETMMNg&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AT6LZA45EU7RA24S4FBJYDDWOC677ANCNFSM6AAAAAASY25XQM&d=DwMCaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=TBqdYWqXjETlIA2Uvuuf2LcxS4Bdn5SdrWALXSZN7rE&m=MVAc05ljng5_fR3jiq5Emg4jGOAXRh0qPry3SubP6hg&s=t3ilFpIvu6f9GVf126gfdJk_CJSwNCHMkgpiXyH3gXA&e=. You are receiving this because you were mentioned.Message ID: @.***>

Afera672 avatar Dec 19 '22 20:12 Afera672

Hi @Afera672, would it be possible to upload the notebook you're using as a github gist so I can follow along exactly and clearly how things are going? Thanks!

muellerzr avatar Dec 20 '22 00:12 muellerzr

@Zachary @.***> Hi Zachary,

Of course I can send you the notebook. Can you simply send me an email address to send it to you?

Thanks!!! Andrea

PS our offer to pay a fee for your consulting services is still open as well.

--

From: Zachary Mueller @.> Date: Monday, December 19, 2022 at 19:41 To: huggingface/accelerate @.> Cc: Andrea Fera @.>, Mention @.> Subject: Re: [huggingface/accelerate] CUDA initialization (Issue #908)

Hi @Afera672https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Afera672&d=DwMCaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=TBqdYWqXjETlIA2Uvuuf2LcxS4Bdn5SdrWALXSZN7rE&m=qWL9o-6k8Sqeh8iBa9PYvfEsW1wrL66uRBkzBQKHr3o&s=Qub_XAESamKtvppvjV49uMv1yyxzkdUmWNisnxF_DPM&e=, would it be possible to upload the notebook you're using as a github gist so I can follow along exactly and clearly how things are going? Thanks!

— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_huggingface_accelerate_issues_908-23issuecomment-2D1358679012&d=DwMCaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=TBqdYWqXjETlIA2Uvuuf2LcxS4Bdn5SdrWALXSZN7rE&m=qWL9o-6k8Sqeh8iBa9PYvfEsW1wrL66uRBkzBQKHr3o&s=pD0F5jpAu2pBk5gytlT-dW9c0sLfQO8-Zu41R3HDUks&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AT6LZA4L7ELMQ43G3BO4KLDWOD6DRANCNFSM6AAAAAASY25XQM&d=DwMCaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=TBqdYWqXjETlIA2Uvuuf2LcxS4Bdn5SdrWALXSZN7rE&m=qWL9o-6k8Sqeh8iBa9PYvfEsW1wrL66uRBkzBQKHr3o&s=JJpAd9E8G2XSb8BV98ge1o24xETrUeE8Scitaq8fUxs&e=. You are receiving this because you were mentioned.Message ID: @.***>

Afera672 avatar Dec 20 '22 16:12 Afera672

@muellerzr sorry...this is me trying to send you the file...

Afera672 avatar Dec 21 '22 02:12 Afera672

@muellerzr I beg your pardon Zach. I am not very well versed with this interface. I try to send it now again. No, it does not allow to attach .iypnb notebooks. I am sorry: it attaches this script only as pdf. Here it is. I hop eit is clear enough...

Multi-GPUs not working ASCI.pdf

Afera672 avatar Dec 21 '22 03:12 Afera672

Thanks @Afera672, your issue is this line, you shouldn't be re-making dataloaders:

learn = unet_learner(resnet34,dls,
 dls=TfmdDL(after_item=ToTensor(4,80,80),
 after_batch=[IntToFloatTensor(), 
*aug_transforms()], bs=8))

To fix, (or at least get further) change that code to be:

learn = unet_learner(resnet34,dls)

muellerzr avatar Dec 21 '22 03:12 muellerzr

Thanks @muellerzr .

Yes, I am sorry: I put actually the file before finding that it is indeed wrong. I found an other issue as well. With the codes. Trying to fix it. my problem with such a simple dls is that I do nto know how to make the transform or how to tell it to make all of a certain dimension. This is why I used a datablock before. Which form a datablock needs to have here? the regular one? I'll try. Thanks!!

Afera672 avatar Dec 21 '22 04:12 Afera672

I'd recommend opening a thread on the fastai forums for more help, since the issue is with the framework more than Accelerate specifically :)

https://forums.fast.ai

muellerzr avatar Dec 21 '22 04:12 muellerzr

Thank you for the insightful suggestion, @muellerzr , but I have a strange problem I believe: After I run: Screenshot 2022-12-21 at 11 53 57 The problem is that now Accelerate needs me to use SegmentationDataLoaders . And I need to insert transformations, but I do not know how to do it. Can you send me an example with SegmentationDataLoaders where you insert the 'Items_transform' or 'after_item' in order to standardize the images seen by the algorithm? Here is what it says when I run notebook_launcher:

Screenshot 2022-12-21 at 11 55 08

Thanks for your help!

Afera672 avatar Dec 21 '22 17:12 Afera672

@muellerzr

Hi Zack, I have an important update. I realized that the segmentationDataLoaders.from_label_func() is a function that evidently ember both datablock and dataloader characteristics, so I inserted size-standardization of the images. And it worked, AT FIRST:

Afera672 avatar Dec 21 '22 19:12 Afera672

Screenshot 2022-12-21 at 14 24 09

Afera672 avatar Dec 21 '22 19:12 Afera672

@muellerzr But if I start the calculation on more than 2 GPUs, it crashes for out-of-memory errors: Screenshot 2022-12-21 at 14 34 20 Now, the reason why I want to use many GPUs is exactly for avoiding this sort of errors. Do you have any idea how could I manage the memory and/or ask accelerate to do it for us? we plan to have MANY images to train, and use at least Resnet50... while now I am confined to Resnet34. Which is not bad but... Thank you for your time!

Afera672 avatar Dec 21 '22 19:12 Afera672

I also encountered this problem and don't know how to solve it. I know that cuda is guaranteed not to be initialized before running jupyter_laucher. But none of my previous codes were initialized. Or called torch.cuda? What should we do?

ckpt_path = 'baichuan13b_ner'

optimizer = bnb.optim.adamw.AdamW(peft_model.parameters(), lr=6e-05,is_paged=True) #'paged_adamw'

初始化KerasModel

keras_model = KerasModel(peft_model, loss_fn =None, optimizer=optimizer)

加载微调后的权重

keras_model.load_ckpt(ckpt_path)

使用多GPU训练

keras_model.fit_ddp(num_processes=2, train_data=dl_train, val_data=dl_val, epochs=100, patience=10, monitor='val_loss', mode='min', ckpt_path=ckpt_path)

> ValueError                                Traceback (most recent call last)
> Cell In[30], line 12
>       9 keras_model.load_ckpt(ckpt_path)
>      11 # 使用多GPU训练
> ---> 12 keras_model.fit_ddp(num_processes=2,
>      13                     train_data=dl_train,
>      14                     val_data=dl_val,
>      15                     epochs=100,
>      16                     patience=10,
>      17                     monitor='val_loss',
>      18                     mode='min',
>      19                     ckpt_path=ckpt_path)
> 
> File ~/anaconda3/envs/baichuan13b/lib/python3.9/site-packages/torchkeras/kerasmodel.py:282, in KerasModel.fit_ddp(self, num_processes, train_data, val_data, epochs, ckpt_path, patience, monitor, mode, callbacks, plot, wandb, quiet, mixed_precision, cpu, gradient_accumulation_steps)
>     279 from accelerate import notebook_launcher
>     280 args = (train_data,val_data,epochs,ckpt_path,patience,monitor,mode,
>     281     callbacks,plot,wandb,quiet,mixed_precision,cpu,gradient_accumulation_steps)
> --> 282 notebook_launcher(self.fit, args, num_processes=num_processes)
> 
> File ~/anaconda3/envs/baichuan13b/lib/python3.9/site-packages/accelerate/launchers.py:116, in notebook_launcher(function, args, num_processes, mixed_precision, use_port)
>     113 from torch.multiprocessing.spawn import ProcessRaisedException
>     115 if len(AcceleratorState._shared_state) > 0:
> --> 116     raise ValueError(
>     117         "To launch a multi-GPU training from your notebook, the `Accelerator` should only be initialized "
>     118         "inside your training function. Restart your notebook and make sure no cells initializes an "
>     119         "`Accelerator`."
>     120     )
>     122 if torch.cuda.is_initialized():
>     123     raise ValueError(
>     124         "To launch a multi-GPU training from your notebook, you need to avoid running any instruction "
>     125         "using `torch.cuda` in any cell. Restart your notebook and make sure no cells use any CUDA "
>     126         "function."
>     127     )
> 
> ValueError: To launch a multi-GPU training from your notebook, the `Accelerator` should only be initialized inside your training function. Restart your notebook and make sure no cells initializes an `Accelerator`.

looperEit avatar Aug 02 '23 18:08 looperEit