Megatron-LM
Megatron-LM copied to clipboard
merge_mp_partitions.py fails with an exception
When I run tools/merge_mp_partitions.py, it fails with an exception:
Traceback (most recent call last):
File "merge_mp_partitions.py", line 286, in <module>
main()
File "merge_mp_partitions.py", line 212, in main
merged_model = get_model(model_type)
File "merge_mp_partitions.py", line 125, in get_model
model = model_provider()
File "/data/gcooper/nlg-evaluation/Megatron-LM/pretrain_gpt2.py", line 35, in model_provider
model = GPT2Model(num_tokentypes=0, parallel_output=True)
File "/data/gcooper/nlg-evaluation/Megatron-LM/megatron/model/gpt2_model.py", line 51, in __init__
args.num_layers))
File "/data/gcooper/nlg-evaluation/Megatron-LM/megatron/model/language_model.py", line 62, in get_language_model
add_pooler=add_pooler)
File "/data/gcooper/nlg-evaluation/Megatron-LM/megatron/model/language_model.py", line 283, in __init__
self.num_tokentypes)
File "/data/gcooper/nlg-evaluation/Megatron-LM/megatron/model/language_model.py", line 123, in __init__
vocab_size, self.hidden_size, init_method=self.init_method)
File "/data/gcooper/nlg-evaluation/Megatron-LM/megatron/mpu/layers.py", line 145, in __init__
partition_dim=0, stride=1)
File "/data/gcooper/nlg-evaluation/Megatron-LM/megatron/mpu/layers.py", line 58, in _initialize_affine_weight_gpu
with get_cuda_rng_tracker().fork():
File "/opt/conda/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "/data/gcooper/nlg-evaluation/Megatron-LM/megatron/mpu/random.py", line 183, in fork
raise Exception('cuda rng state {} is not added'.format(name))
Exception: cuda rng state model-parallel-rng is not added
When training, the RNG state gets set in initialize_megatron(), but that is not called in this case.
Hello, has anybody solved this problem? Is there a workaround? Thanks.
Same problem here
I have a similar problems.
Traceback (most recent call last): File "/data/liuguang/Sailing/tests/test_trainer_deepspeed.py", line 193, in
print(model(**batch)) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1589, in forward loss = self.module(*inputs, **kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/data/liuguang/Sailing/easybigmodel/model/glm_model.py", line 305, in forward model_out= self.model(input_ids, position_ids, attention_mask) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/data/liuguang/Sailing/easybigmodel/model/glm_model_mpu.py", line 122, in forward transformer_output = self.transformer(embeddings, position_ids, attention_mask, mems, File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/data/liuguang/Sailing/easybigmodel/model/blocks/transformer_mpu.py", line 655, in forward hidden_states = layer(*args, mem=mem_i) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/data/liuguang/Sailing/easybigmodel/model/blocks/transformer_mpu.py", line 402, in forward attention_output = self.attention(layernorm_output, ltor_mask, position_embeddings, r_w_bias, r_r_bias, mem) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/data/liuguang/Sailing/easybigmodel/model/layers/attentions_mpu.py", line 394, in forward with get_cuda_rng_tracker().fork(): File "/opt/conda/lib/python3.8/contextlib.py", line 113, in enter return next(self.gen) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 174, in fork raise Exception('cuda rng state {} is not added'.format(name)) Exception: cuda rng state model-parallel-rng is not added
Besides, what‘s with get_cuda_rng_tracker().fork():
doing?
Marking as stale. No activity in 60 days. Remove stale label or comment or this will be closed in 7 days.
Marking as stale. No activity in 60 days.
I got the same problem. Have you solved it ?
Marking as stale. No activity in 60 days.
It is caused by not initializing rng state. The code belows would work
import torch
import torch.distributed as dist
from megatron.core import mpu, tensor_parallel
dist.init_process_group()
torch.cuda.set_device(dist.get_rank())
mpu.initialize_model_parallel(xxxx)
tensor_parallel.random.model_parallel_cuda_manual_seed(xxx)
há uma sugestão relacionada para expandir o script para mesclar tanto o paralelismo de tensor quanto de pipeline e também fornecer um script para dividir o checkpoint em partições separadamente2. Isso pode valer a pena investigar também.
Marking as stale. No activity in 60 days.