Hetu icon indicating copy to clipboard operation
Hetu copied to clipboard

Question about using galvatron AssertionError: 50257 is not divisible by 4

Open CannonWWW opened this issue 8 months ago • 1 comments

I am very sorry for possibly cloning the wrong branch previously. However, when I tried to run the train_dist.sh script according to the README file using this branch, I encountered some difficulties as the system indicated that a module named gpt was missing. Subsequently, I tried switching to another repository, but I encountered the following error:

Traceback (most recent call last):
  File "train_dist.py", line 87, in <module>
    train(args)
  File "train_dist.py", line 33, in train
    model = construct_hybrid_parallel_model(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/galvatron/models/gpt_hf/GPTModel_hybrid_parallel.py", line 12, in construct_hybrid_parallel_model
    hp_model = construct_hybrid_parallel_model_api(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/galvatron/core/hybrid_parallel_model.py", line 114, in construct_hybrid_parallel_model_api
    model = construct_tensor_parallel_model(model, config, tp_groups_whole) # get_enc_groups(tp_groups_whole, module_types))
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/galvatron/models/gpt_hf/GPTModel_tensor_parallel.py", line 75, in construct_tensor_parallel_model
    setattr(model.transformer, 'wte', VocabParallelEmbedding(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/galvatron/site_package/megatron/core/tensor_parallel/layers.py", line 194, in __init__
    ) = VocabUtility.vocab_range_from_global_vocab_size(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/galvatron/site_package/megatron/core/tensor_parallel/utils.py", line 110, in vocab_range_from_global_vocab_size
    per_partition_vocab_size = divide(global_vocab_size, world_size)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/galvatron/site_package/megatron/core/utils.py", line 22, in divide
    ensure_divisibility(numerator, denominator)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/galvatron/site_package/megatron/core/utils.py", line 16, in ensure_divisibility
    assert numerator % denominator == 0, "{} is not divisible by {}".format(numerator, denominator)
AssertionError: 50257 is not divisible by 4

I tried to modify the vocab_size, but it still doesn't work. Then I tried to use stable version in this repository, I encountered following error:

Traceback (most recent call last):
  File "train_dist.py", line 99, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
  File "train_dist.py", line 99, in <module>
  File "train_dist.py", line 99, in <module>
    train(args)
  File "train_dist.py", line 42, in train
    gpt_model = GPTLMHeadModel(config, device='meta' if args.initialize_on_meta else 'cpu')
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 582, in __init__
    train(args)    
train(args)Traceback (most recent call last):
  File "train_dist.py", line 42, in train

  File "train_dist.py", line 42, in train
  File "train_dist.py", line 99, in <module>
        gpt_model = GPTLMHeadModel(config, device='meta' if args.initialize_on_meta else 'cpu')gpt_model = GPTLMHeadModel(config, device='meta' if args.initialize_on_meta else 'cpu')

  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 582, in __init__
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 582, in __init__
Traceback (most recent call last):
  File "train_dist.py", line 99, in <module>
    self.transformer = GPTModel(config, process_group=process_group, **factory_kwargs)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 466, in __init__
Traceback (most recent call last):
  File "train_dist.py", line 99, in <module>
        [    self.transformer = GPTModel(config, process_group=process_group, **factory_kwargs)
self.transformer = GPTModel(config, process_group=process_group, **factory_kwargs)    
Traceback (most recent call last):
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 467, in <listcomp>

  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 466, in __init__
train(args)  File "train_dist.py", line 99, in <module>

  File "train_dist.py", line 42, in train
    train(args)
  File "train_dist.py", line 42, in train
    create_block(config, layer_idx=i, process_group=process_group, **factory_kwargs)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 279, in create_block
        [gpt_model = GPTLMHeadModel(config, device='meta' if args.initialize_on_meta else 'cpu')

  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 467, in <listcomp>
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 582, in __init__
    gpt_model = GPTLMHeadModel(config, device='meta' if args.initialize_on_meta else 'cpu')        
block = Block(train(args)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/block.py", line 68, in __init__

  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 582, in __init__
  File "train_dist.py", line 42, in train
    train(args)
    self.mixer = mixer_cls(dim)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/mha.py", line 456, in __init__
  File "train_dist.py", line 42, in train
    Traceback (most recent call last):
create_block(config, layer_idx=i, process_group=process_group, **factory_kwargs)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 279, in create_block
  File "train_dist.py", line 99, in <module>
    gpt_model = GPTLMHeadModel(config, device='meta' if args.initialize_on_meta else 'cpu')
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 582, in __init__
    raise ImportError("fused_dense is not installed")    
block = Block(
      File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/block.py", line 68, in __init__
ImportErrorgpt_model = GPTLMHeadModel(config, device='meta' if args.initialize_on_meta else 'cpu'): fused_dense is not installed

  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 582, in __init__
    self.mixer = mixer_cls(dim)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/mha.py", line 456, in __init__
    raise ImportError("fused_dense is not installed")
ImportError: fused_dense is not installed
    self.transformer = GPTModel(config, process_group=process_group, **factory_kwargs)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 466, in __init__
    train(args)
  File "train_dist.py", line 42, in train
    self.transformer = GPTModel(config, process_group=process_group, **factory_kwargs)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 466, in __init__
        self.transformer = GPTModel(config, process_group=process_group, **factory_kwargs)gpt_model = GPTLMHeadModel(config, device='meta' if args.initialize_on_meta else 'cpu')

  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 466, in __init__
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 582, in __init__
    self.transformer = GPTModel(config, process_group=process_group, **factory_kwargs)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 466, in __init__
    [
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 467, in <listcomp>
    [
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 467, in <listcomp>
    [
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 467, in <listcomp>
    [
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 467, in <listcomp>
    create_block(config, layer_idx=i, process_group=process_group, **factory_kwargs)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 279, in create_block
        create_block(config, layer_idx=i, process_group=process_group, **factory_kwargs)
self.transformer = GPTModel(config, process_group=process_group, **factory_kwargs)  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 279, in create_block

  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 466, in __init__
    create_block(config, layer_idx=i, process_group=process_group, **factory_kwargs)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 279, in create_block
    block = Block(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/block.py", line 68, in __init__
    create_block(config, layer_idx=i, process_group=process_group, **factory_kwargs)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 279, in create_block
    block = Block(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/block.py", line 68, in __init__
    self.mixer = mixer_cls(dim)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/mha.py", line 456, in __init__
    self.mixer = mixer_cls(dim)
      File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/mha.py", line 456, in __init__
block = Block(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/block.py", line 68, in __init__
    block = Block(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/block.py", line 68, in __init__
    [    
self.mixer = mixer_cls(dim)  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 466, in __init__

  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 467, in <listcomp>
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/mha.py", line 456, in __init__
    self.mixer = mixer_cls(dim)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/mha.py", line 456, in __init__
    raise ImportError("fused_dense is not installed")
ImportError: fused_dense is not installed
    raise ImportError("fused_dense is not installed")
ImportError: fused_dense is not installed
    [
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 467, in <listcomp>
    raise ImportError("fused_dense is not installed")
ImportError: fused_dense is not installed
    raise ImportError("fused_dense is not installed")
ImportError: fused_dense is not installed
    create_block(config, layer_idx=i, process_group=process_group, **factory_kwargs)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 279, in create_block
    create_block(config, layer_idx=i, process_group=process_group, **factory_kwargs)    block = Block(

  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/block.py", line 68, in __init__
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/models/gpt.py", line 279, in create_block
    self.mixer = mixer_cls(dim)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/mha.py", line 456, in __init__
    raise ImportError("fused_dense is not installed")
ImportError: fused_dense is not installed
    block = Block(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/block.py", line 68, in __init__
    self.mixer = mixer_cls(dim)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/flash_attn/modules/mha.py", line 456, in __init__
    raise ImportError("fused_dense is not installed")
ImportError: fused_dense is not installed
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2890126) of binary: /home/wyr/anaconda3/envs/galvatron/bin/python3
Traceback (most recent call last):
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train_dist.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-06-21_23:09:56
  host      : SYS-4029GP-TRTC-ZY001
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2890127)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-06-21_23:09:56
  host      : SYS-4029GP-TRTC-ZY001
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 2890128)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-06-21_23:09:56
  host      : SYS-4029GP-TRTC-ZY001
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 2890129)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2024-06-21_23:09:56
  host      : SYS-4029GP-TRTC-ZY001
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 2890130)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2024-06-21_23:09:56
  host      : SYS-4029GP-TRTC-ZY001
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 2890132)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2024-06-21_23:09:56
  host      : SYS-4029GP-TRTC-ZY001
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 2890134)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2024-06-21_23:09:56
  host      : SYS-4029GP-TRTC-ZY001
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 2890137)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-21_23:09:56
  host      : SYS-4029GP-TRTC-ZY001
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2890126)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Could you please help me resolve this issue? Or could you provide some possible solutions? Thank you for your help and support!

CannonWWW avatar Jun 06 '24 11:06 CannonWWW