Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

merge_mp_partitions.py fails with an exception

Open gcooper-isi opened this issue 4 years ago • 9 comments

When I run tools/merge_mp_partitions.py, it fails with an exception:

Traceback (most recent call last):
  File "merge_mp_partitions.py", line 286, in <module>
    main()
  File "merge_mp_partitions.py", line 212, in main
    merged_model = get_model(model_type)
  File "merge_mp_partitions.py", line 125, in get_model
    model = model_provider()
  File "/data/gcooper/nlg-evaluation/Megatron-LM/pretrain_gpt2.py", line 35, in model_provider
    model = GPT2Model(num_tokentypes=0, parallel_output=True)
  File "/data/gcooper/nlg-evaluation/Megatron-LM/megatron/model/gpt2_model.py", line 51, in __init__
    args.num_layers))
  File "/data/gcooper/nlg-evaluation/Megatron-LM/megatron/model/language_model.py", line 62, in get_language_model
    add_pooler=add_pooler)
  File "/data/gcooper/nlg-evaluation/Megatron-LM/megatron/model/language_model.py", line 283, in __init__
    self.num_tokentypes)
  File "/data/gcooper/nlg-evaluation/Megatron-LM/megatron/model/language_model.py", line 123, in __init__
    vocab_size, self.hidden_size, init_method=self.init_method)
  File "/data/gcooper/nlg-evaluation/Megatron-LM/megatron/mpu/layers.py", line 145, in __init__
    partition_dim=0, stride=1)
  File "/data/gcooper/nlg-evaluation/Megatron-LM/megatron/mpu/layers.py", line 58, in _initialize_affine_weight_gpu
    with get_cuda_rng_tracker().fork():
  File "/opt/conda/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/data/gcooper/nlg-evaluation/Megatron-LM/megatron/mpu/random.py", line 183, in fork
    raise Exception('cuda rng state {} is not added'.format(name))
Exception: cuda rng state model-parallel-rng is not added

When training, the RNG state gets set in initialize_megatron(), but that is not called in this case.

gcooper-isi avatar Nov 20 '20 22:11 gcooper-isi

Hello, has anybody solved this problem? Is there a workaround? Thanks.

hejjack avatar Jan 09 '21 14:01 hejjack

Same problem here

Lavenderjiang avatar Sep 27 '21 00:09 Lavenderjiang

I have a similar problems.

Traceback (most recent call last): File "/data/liuguang/Sailing/tests/test_trainer_deepspeed.py", line 193, in print(model(**batch)) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1589, in forward loss = self.module(*inputs, **kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/data/liuguang/Sailing/easybigmodel/model/glm_model.py", line 305, in forward model_out= self.model(input_ids, position_ids, attention_mask) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/data/liuguang/Sailing/easybigmodel/model/glm_model_mpu.py", line 122, in forward transformer_output = self.transformer(embeddings, position_ids, attention_mask, mems, File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/data/liuguang/Sailing/easybigmodel/model/blocks/transformer_mpu.py", line 655, in forward hidden_states = layer(*args, mem=mem_i) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/data/liuguang/Sailing/easybigmodel/model/blocks/transformer_mpu.py", line 402, in forward attention_output = self.attention(layernorm_output, ltor_mask, position_embeddings, r_w_bias, r_r_bias, mem) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/data/liuguang/Sailing/easybigmodel/model/layers/attentions_mpu.py", line 394, in forward with get_cuda_rng_tracker().fork(): File "/opt/conda/lib/python3.8/contextlib.py", line 113, in enter return next(self.gen) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 174, in fork raise Exception('cuda rng state {} is not added'.format(name)) Exception: cuda rng state model-parallel-rng is not added

Besides, what‘s with get_cuda_rng_tracker().fork(): doing?

marscrazy avatar Mar 15 '22 08:03 marscrazy

Marking as stale. No activity in 60 days. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Jul 10 '23 18:07 github-actions[bot]

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Sep 19 '23 18:09 github-actions[bot]

I got the same problem. Have you solved it ?

ZhangEnmao avatar Feb 06 '24 12:02 ZhangEnmao

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Apr 07 '24 18:04 github-actions[bot]

It is caused by not initializing rng state. The code belows would work

import torch
import torch.distributed as dist
from megatron.core import mpu, tensor_parallel

dist.init_process_group()
torch.cuda.set_device(dist.get_rank())
mpu.initialize_model_parallel(xxxx)
tensor_parallel.random.model_parallel_cuda_manual_seed(xxx)

tlogn avatar May 07 '24 13:05 tlogn

há uma sugestão relacionada para expandir o script para mesclar tanto o paralelismo de tensor quanto de pipeline e também fornecer um script para dividir o checkpoint em partições separadamente2. Isso pode valer a pena investigar também.

felipeliliti avatar May 07 '24 13:05 felipeliliti

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Jul 06 '24 18:07 github-actions[bot]