llama icon indicating copy to clipboard operation
llama copied to clipboard

ERROR:torch.distributed.elastic.multiprocessing.api:failed

Open ZhuJD-China opened this issue 1 year ago • 8 comments

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2995886) of binary: /usr/bin/python3

@dl:~/llama$ CUDA_VISIBLE_DEVICES="5,6,7" torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4

initializing model parallel with size 1 initializing ddp with size 1 initializing pipeline with size 1 Loaded in 22.04 seconds Traceback (most recent call last): File "example_chat_completion.py", line 89, in fire.Fire(main) File "/8T/zjd/.local/lib/python3.8/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/8T/zjd/.local/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/8T/zjd/.local/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "example_chat_completion.py", line 72, in main results = generator.chat_completion( File "/8T/zjd/llama/llama/generation.py", line 268, in chat_completion generation_tokens, generation_logprobs = self.generate( File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/8T/zjd/llama/llama/generation.py", line 117, in generate assert bsz <= params.max_batch_size, (bsz, params.max_batch_size) AssertionError: (6, 4) ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2995886) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 724, in main run(args) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ example_chat_completion.py FAILED


Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-08-23_10:10:21 host : dl rank : 0 (local_rank: 0) exitcode : 1 (pid: 2995886) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

ZhuJD-China avatar Aug 23 '23 02:08 ZhuJD-China

Did you try --max_batch_size 6?

EmanuelaBoros avatar Aug 23 '23 10:08 EmanuelaBoros

Hey, I was getting the same error. When I tried running --max_batch_size 6, I am getting this:

> initializing model parallel with size 1                                                                                                                                                                   
> initializing ddp with size 1                                                                                                                                                                              
> initializing pipeline with size 1                                                                                                                                                                         
Loaded in 156.47 seconds                                                                                                                                                                                    
Traceback (most recent call last):                                                                                                                                                                          
  File "/efs/users/manjunathan/OpenLLMs/llama/example_chat_completion.py", line 89, in <module>                                                                                                             
    fire.Fire(main)                                                                                                                                                                                         
  File "/efs/users/manjunathan/OpenLLMs/llmenv/lib/python3.10/site-packages/fire/core.py", line 141, in Fire                                                                                                
    component_trace = _Fire(component, args, parsed_flag_args, context, name)                                                                                                                               
  File "/efs/users/manjunathan/OpenLLMs/llmenv/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire                                                                                               
    component, remaining_args = _CallAndUpdateTrace(                                                                                                                                                        
  File "/efs/users/manjunathan/OpenLLMs/llmenv/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace                                                                                 
    component = fn(*varargs, **kwargs)                                                                                                                                                                      
  File "/efs/users/manjunathan/OpenLLMs/llama/example_chat_completion.py", line 72, in main                                                                                                                 
    results = generator.chat_completion(                                                                                                                                                                    
  File "/efs/users/manjunathan/OpenLLMs/llama/llama/generation.py", line 271, in chat_completion                                                              
    generation_tokens, generation_logprobs = self.generate(                                                                                                   
  File "/efs/users/manjunathan/OpenLLMs/llmenv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context                        
    return func(*args, **kwargs)                                                                                                                              
  File "/efs/users/manjunathan/OpenLLMs/llama/llama/generation.py", line 138, in generate                                                                     
    logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos)                                                                                        
  File "/efs/users/manjunathan/OpenLLMs/llmenv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context                        
    return func(*args, **kwargs)                                                                                                                              
  File "/efs/users/manjunathan/OpenLLMs/llama/llama/model.py", line 285, in forward                                                                           
    h = layer(h, start_pos, freqs_cis, mask)                                                                                                                  
  File "/efs/users/manjunathan/OpenLLMs/llmenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl                             
    return forward_call(*args, **kwargs)                                                                                                                      
  File "/efs/users/manjunathan/OpenLLMs/llama/llama/model.py", line 239, in forward                                                                           
    h = x + self.attention.forward(                                                                                                                           
  File "/efs/users/manjunathan/OpenLLMs/llama/llama/model.py", line 153, in forward                                                                                               
    xq, xk, xv = self.wq(x), self.wk(x), self.wv(x)                                                                                                           
  File "/efs/users/manjunathan/OpenLLMs/llmenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl                             
    return forward_call(*args, **kwargs)                                                                                                                      
  File "/efs/users/manjunathan/OpenLLMs/llmenv/lib/python3.10/site-packages/fairscale/nn/model_parallel/layers.py", line 290, in forward                                          
    output_parallel = F.linear(input_parallel, self.weight, self.bias)                                                                                        
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`                                                                   
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 4055) of binary: /efs/users/manjunathan/OpenLLMs/llmenv/bin/python3                                            
Traceback (most recent call last):                                                                                                                            
  File "/efs/users/manjunathan/OpenLLMs/llmenv/bin/torchrun", line 8, in <module>                                                                             
    sys.exit(main())                                                                                                                                          
  File "/efs/users/manjunathan/OpenLLMs/llmenv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper                                             
    return f(*args, **kwargs)                                                                                                                                 
  File "/efs/users/manjunathan/OpenLLMs/llmenv/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main                                      
    run(args)                                                                  
  File "/efs/users/manjunathan/OpenLLMs/llmenv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run                                                                                     
    elastic_launch(                                                                                                                                           
  File "/efs/users/manjunathan/OpenLLMs/llmenv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__                                                                       
    return launch_agent(self._config, self._entrypoint, list(args))                                                                                           
  File "/efs/users/manjunathan/OpenLLMs/llmenv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent                                                                   
    raise ChildFailedError(                                                              
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:                                                                                            
============================================================                                          
example_chat_completion.py FAILED                               ```

cc: @EmanuelaBoros 

manjunathan-msd avatar Aug 23 '23 11:08 manjunathan-msd

thank you!i had solve it

ZhuJD-China avatar Aug 23 '23 12:08 ZhuJD-China

I met the same problem, and i tried set --max_batch_size 6, but it didn't work. How can I solve it, thanks very much? @ZhuJD-China

joeyz0z avatar Aug 23 '23 14:08 joeyz0z

@ZhuJD-China Great that you solved it. The second error is cuda-related (RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)`.

@joeyz0z what is the error stack?

EmanuelaBoros avatar Aug 23 '23 16:08 EmanuelaBoros

I have the same error, i'm trying this: torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 6

And the output: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 333202) of binary: /home/akemid/llama/venv/bin/python3 Traceback (most recent call last): File "/home/akemid/llama/venv/bin/torchrun", line 8, in sys.exit(main()) File "/home/akemid/llama/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/akemid/llama/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/akemid/llama/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/akemid/llama/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/akemid/llama/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

example_chat_completion.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-08-31_23:12:13 host : LAPTOP-ISPM103M. rank : 0 (local_rank: 0) exitcode : -9 (pid: 333202) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 333202

Akemid avatar Sep 01 '23 04:09 Akemid

thank you!i had solve it

How?

outOFFspace avatar Nov 01 '23 22:11 outOFFspace

thank you!i had solve it

how to solve it?

lsm140 avatar Dec 07 '23 08:12 lsm140