DeepSpeed
DeepSpeed copied to clipboard
[BUG] Config mesh_device None
I am using ds 0.15.1 on two A6000 GPUs, following the huggingface Non-Trainer DeepSpeed integration,
got assertion error:
guanhua@guanhua-Lambda:~/DiscQuant$ deepspeed test_hf_ds.py
[2024-09-06 15:53:29,210] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-06 15:53:29,660] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-06 15:53:30,664] [WARNING] [runner.py:212:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-09-06 15:53:30,664] [INFO] [runner.py:585:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None test_hf_ds.py
[2024-09-06 15:53:32,031] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-06 15:53:32,476] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-06 15:53:33,468] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2024-09-06 15:53:33,468] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=2, node_rank=0
[2024-09-06 15:53:33,468] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2024-09-06 15:53:33,468] [INFO] [launch.py:164:main] dist_world_size=2
[2024-09-06 15:53:33,468] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2024-09-06 15:53:33,469] [INFO] [launch.py:256:main] process 513898 spawned with command: ['/usr/bin/python3', '-u', 'test_hf_ds.py', '--local_rank=0']
[2024-09-06 15:53:33,469] [INFO] [launch.py:256:main] process 513899 spawned with command: ['/usr/bin/python3', '-u', 'test_hf_ds.py', '--local_rank=1']
[2024-09-06 15:53:34,951] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-06 15:53:34,990] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-06 15:53:35,366] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-06 15:53:35,401] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-06 15:54:00,929] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
[2024-09-06 15:54:00,930] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
Traceback (most recent call last):
File "/home/guanhua/DiscQuant/test_hf_ds.py", line 47, in <module>
model = AutoModel.from_pretrained("openai-community/gpt2")
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
return model_class.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3821, in from_pretrained
init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
File "/home/guanhua/.local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 933, in __init__
_ds_config = deepspeed.runtime.config.DeepSpeedConfig(config_dict_or_path,
File "/home/guanhua/.local/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 798, in __init__
Traceback (most recent call last):
File "/home/guanhua/DiscQuant/test_hf_ds.py", line 47, in <module>
self._configure_train_batch_size()
File "/home/guanhua/.local/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 981, in _configure_train_batch_size
model = AutoModel.from_pretrained("openai-community/gpt2")
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
self._batch_assertion()
File "/home/guanhua/.local/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 929, in _batch_assertion
return model_class.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3821, in from_pretrained
assert train_batch == micro_batch * grad_acc * self.world_size, (
AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 2 != 1 * 1 * 1
init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
File "/home/guanhua/.local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 933, in __init__
_ds_config = deepspeed.runtime.config.DeepSpeedConfig(config_dict_or_path,
File "/home/guanhua/.local/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 798, in __init__
self._configure_train_batch_size()
File "/home/guanhua/.local/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 981, in _configure_train_batch_size
self._batch_assertion()
File "/home/guanhua/.local/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 929, in _batch_assertion
assert train_batch == micro_batch * grad_acc * self.world_size, (
AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 2 != 1 * 1 * 1
[2024-09-06 15:54:01,510] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 513898
[2024-09-06 15:54:01,532] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 513899
[2024-09-06 15:54:01,532] [ERROR] [launch.py:325:sigkill_handler] ['/usr/bin/python3', '-u', 'test_hf_ds.py', '--local_rank=1'] exits with return code = 1
I think the root cause is because [config.py:733:__init__] Config mesh_device None world_size = 1, somehow ds_init did not pass the correct mesh_device argument which makes world_size=1 (correct should be 2).
To reproduce, below is the python script I am using, cmd is deepspeed --num_gpus 2 BELOW_PYTHON.py
from transformers.integrations import HfDeepSpeedConfig
from transformers import AutoModel
import deepspeed
ds_config = {
#"fp16": {
# "enabled": "auto",
# "loss_scale": 0,
# "loss_scale_window": 1000,
# "initial_scale_power": 16,
# "hysteresis": 2,
# "min_loss_scale": 1
#},
"bf16": {
"enabled": "auto"
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"overlap_comm": True,
"contiguous_gradients": True,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"gather_16bit_weights_on_model_save": True
},
"gradient_accumulation_steps": 1,
"gradient_clipping": 1.0,
"train_batch_size": 2,
"train_micro_batch_size_per_gpu": 1,
"steps_per_print": 1e5,
"wall_clock_breakdown": False,
"data_parallel_size": 2
}
ds_cf = HfDeepSpeedConfig(ds_config)
model = AutoModel.from_pretrained("openai-community/gpt2")
engine = deepspeed.initialize(model=model, config_params=ds_config, dist_init_required=True)
Did you manage a fix here?
Got the same error here :( Any updates on this? Thanks!
+1
Looks like deepspeed.init_distributed() must be called at the beginning.
Looks like
deepspeed.init_distributed()must be called at the beginning.
It seems no work
I'm also facing the same issue but below workaround worked for me. I'm using accelerate with deepspeed config.
import deepspeed
deepspeed.init_distributed()
current_device = "cuda:{}".format(os.environ.get("LOCAL_RANK", "0"))
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype="auto",
device_map=current_device,
)
key change was getting the current_device and setting it to device_map instead of using "auto".
This led to below warning
UserWarning: No device id is provided via init_process_group or barrier . Using the current device set by the user.
but the training started.