Hetu
Hetu copied to clipboard
Question about using galvatron AssertionError: 50257 is not divisible by 4
I am very sorry for possibly cloning the wrong branch previously. However, when I tried to run the train_dist.sh script according to the README file using this branch, I encountered some difficulties as the system indicated that a module named gpt was missing. Subsequently, I tried switching to another repository, but I encountered the following error:
Traceback (most recent call last):
File "train_dist.py", line 87, in <module>
train(args)
File "train_dist.py", line 33, in train
model = construct_hybrid_parallel_model(
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/galvatron/models/gpt_hf/GPTModel_hybrid_parallel.py", line 12, in construct_hybrid_parallel_model
hp_model = construct_hybrid_parallel_model_api(
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/galvatron/core/hybrid_parallel_model.py", line 114, in construct_hybrid_parallel_model_api
model = construct_tensor_parallel_model(model, config, tp_groups_whole) # get_enc_groups(tp_groups_whole, module_types))
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/galvatron/models/gpt_hf/GPTModel_tensor_parallel.py", line 75, in construct_tensor_parallel_model
setattr(model.transformer, 'wte', VocabParallelEmbedding(
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/galvatron/site_package/megatron/core/tensor_parallel/layers.py", line 194, in __init__
) = VocabUtility.vocab_range_from_global_vocab_size(
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/galvatron/site_package/megatron/core/tensor_parallel/utils.py", line 110, in vocab_range_from_global_vocab_size
per_partition_vocab_size = divide(global_vocab_size, world_size)
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/galvatron/site_package/megatron/core/utils.py", line 22, in divide
ensure_divisibility(numerator, denominator)
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/galvatron/site_package/megatron/core/utils.py", line 16, in ensure_divisibility
assert numerator % denominator == 0, "{} is not divisible by {}".format(numerator, denominator)
AssertionError: 50257 is not divisible by 4
I tried to modify the vocab_size, but it still doesn't work. Then I tried to use stable version in this repository, I encountered following error:
Traceback (most recent call last):
File "train_dist.py", line 99, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
File "train_dist.py", line 99, in <module>
File "train_dist.py", line 99, in <module>
train(args)
File "train_dist.py", line 42, in train
gpt_model = GPTLMHeadModel(config, device='meta' if args.initialize_on_meta else 'cpu')
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 582, in __init__
train(args)
train(args)Traceback (most recent call last):
File "train_dist.py", line 42, in train
File "train_dist.py", line 42, in train
File "train_dist.py", line 99, in <module>
gpt_model = GPTLMHeadModel(config, device='meta' if args.initialize_on_meta else 'cpu')gpt_model = GPTLMHeadModel(config, device='meta' if args.initialize_on_meta else 'cpu')
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 582, in __init__
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 582, in __init__
Traceback (most recent call last):
File "train_dist.py", line 99, in <module>
self.transformer = GPTModel(config, process_group=process_group, **factory_kwargs)
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 466, in __init__
Traceback (most recent call last):
File "train_dist.py", line 99, in <module>
[ self.transformer = GPTModel(config, process_group=process_group, **factory_kwargs)
self.transformer = GPTModel(config, process_group=process_group, **factory_kwargs)
Traceback (most recent call last):
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 467, in <listcomp>
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 466, in __init__
train(args) File "train_dist.py", line 99, in <module>
File "train_dist.py", line 42, in train
train(args)
File "train_dist.py", line 42, in train
create_block(config, layer_idx=i, process_group=process_group, **factory_kwargs)
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 279, in create_block
[gpt_model = GPTLMHeadModel(config, device='meta' if args.initialize_on_meta else 'cpu')
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 467, in <listcomp>
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 582, in __init__
gpt_model = GPTLMHeadModel(config, device='meta' if args.initialize_on_meta else 'cpu')
block = Block(train(args)
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/block.py", line 68, in __init__
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 582, in __init__
File "train_dist.py", line 42, in train
train(args)
self.mixer = mixer_cls(dim)
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/mha.py", line 456, in __init__
File "train_dist.py", line 42, in train
Traceback (most recent call last):
create_block(config, layer_idx=i, process_group=process_group, **factory_kwargs)
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 279, in create_block
File "train_dist.py", line 99, in <module>
gpt_model = GPTLMHeadModel(config, device='meta' if args.initialize_on_meta else 'cpu')
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 582, in __init__
raise ImportError("fused_dense is not installed")
block = Block(
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/block.py", line 68, in __init__
ImportErrorgpt_model = GPTLMHeadModel(config, device='meta' if args.initialize_on_meta else 'cpu'): fused_dense is not installed
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 582, in __init__
self.mixer = mixer_cls(dim)
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/mha.py", line 456, in __init__
raise ImportError("fused_dense is not installed")
ImportError: fused_dense is not installed
self.transformer = GPTModel(config, process_group=process_group, **factory_kwargs)
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 466, in __init__
train(args)
File "train_dist.py", line 42, in train
self.transformer = GPTModel(config, process_group=process_group, **factory_kwargs)
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 466, in __init__
self.transformer = GPTModel(config, process_group=process_group, **factory_kwargs)gpt_model = GPTLMHeadModel(config, device='meta' if args.initialize_on_meta else 'cpu')
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 466, in __init__
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 582, in __init__
self.transformer = GPTModel(config, process_group=process_group, **factory_kwargs)
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 466, in __init__
[
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 467, in <listcomp>
[
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 467, in <listcomp>
[
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 467, in <listcomp>
[
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 467, in <listcomp>
create_block(config, layer_idx=i, process_group=process_group, **factory_kwargs)
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 279, in create_block
create_block(config, layer_idx=i, process_group=process_group, **factory_kwargs)
self.transformer = GPTModel(config, process_group=process_group, **factory_kwargs) File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 279, in create_block
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 466, in __init__
create_block(config, layer_idx=i, process_group=process_group, **factory_kwargs)
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 279, in create_block
block = Block(
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/block.py", line 68, in __init__
create_block(config, layer_idx=i, process_group=process_group, **factory_kwargs)
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 279, in create_block
block = Block(
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/block.py", line 68, in __init__
self.mixer = mixer_cls(dim)
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/mha.py", line 456, in __init__
self.mixer = mixer_cls(dim)
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/mha.py", line 456, in __init__
block = Block(
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/block.py", line 68, in __init__
block = Block(
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/block.py", line 68, in __init__
[
self.mixer = mixer_cls(dim) File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 466, in __init__
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 467, in <listcomp>
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/mha.py", line 456, in __init__
self.mixer = mixer_cls(dim)
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/mha.py", line 456, in __init__
raise ImportError("fused_dense is not installed")
ImportError: fused_dense is not installed
raise ImportError("fused_dense is not installed")
ImportError: fused_dense is not installed
[
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 467, in <listcomp>
raise ImportError("fused_dense is not installed")
ImportError: fused_dense is not installed
raise ImportError("fused_dense is not installed")
ImportError: fused_dense is not installed
create_block(config, layer_idx=i, process_group=process_group, **factory_kwargs)
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 279, in create_block
create_block(config, layer_idx=i, process_group=process_group, **factory_kwargs) block = Block(
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/block.py", line 68, in __init__
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 279, in create_block
self.mixer = mixer_cls(dim)
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/mha.py", line 456, in __init__
raise ImportError("fused_dense is not installed")
ImportError: fused_dense is not installed
block = Block(
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/block.py", line 68, in __init__
self.mixer = mixer_cls(dim)
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/mha.py", line 456, in __init__
raise ImportError("fused_dense is not installed")
ImportError: fused_dense is not installed
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2890126) of binary: /home/wyr/anaconda3/envs/galvatron/bin/python3
Traceback (most recent call last):
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module>
main()
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_dist.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-06-21_23:09:56
host : SYS-4029GP-TRTC-ZY001
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2890127)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-06-21_23:09:56
host : SYS-4029GP-TRTC-ZY001
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 2890128)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-06-21_23:09:56
host : SYS-4029GP-TRTC-ZY001
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 2890129)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
time : 2024-06-21_23:09:56
host : SYS-4029GP-TRTC-ZY001
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 2890130)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
time : 2024-06-21_23:09:56
host : SYS-4029GP-TRTC-ZY001
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 2890132)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
time : 2024-06-21_23:09:56
host : SYS-4029GP-TRTC-ZY001
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 2890134)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
time : 2024-06-21_23:09:56
host : SYS-4029GP-TRTC-ZY001
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 2890137)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-06-21_23:09:56
host : SYS-4029GP-TRTC-ZY001
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2890126)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Could you please help me resolve this issue? Or could you provide some possible solutions? Thank you for your help and support!