DeepSpeedExamples FileNotFoundError: [Errno 2] No such file or directory: 'numactl'

After finishing install successfully, i got this error when ran this command: python e2e_rlhf.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu

how to fix it please ?

Aug 28 '24 14:08 zhiwentian

Hi @zhiwentian - you will need to install numactl, depending on your linux distro it might be apt-get install numactl - could you try that?

Aug 28 '24 20:08 loadams

你好@zhiwentian- 您将需要安装 numactl，具体取决于您的 Linux 发行版apt-get install numactl- 您可以尝试一下吗？

Thank you very much, I solved the problem by installing numactl. But now there is a new error, how can I fix it? DeepSpeed general environment info: deepspeed................................0.15.0 torch............................................1.13.1+cu116

Aug 29 '24 03:08 zhiwentian

Hi @zhiwentian - we would need to have more of your repro script, but it looks like fp16 isn't supported, can you share what device you are running on?

Aug 29 '24 16:08 loadams

Hi @zhiwentian - we would need to have more of your repro script, but it looks like fp16 isn't supported, can you share what device you are running on?

DeepSpeed general environment info: deepspeed................................0.15.0 accelerate..................................0.33.0 torch............................................1.13.1+cu116 GPU..............................................NVIDIA A100 80GB

training.log: ‘’‘ [2024-08-30 20:21:17,019] [WARNING] [real_accelerator.py:162:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it. [2024-08-30 20:21:17,027] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cpu (auto detect) [2024-08-30 20:21:18,175] [WARNING] [runner.py:212:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2024-08-30 20:21:18,178] [INFO] [runner.py:585:main] cmd = /mnt/ssd/datagroup/anaconda3/envs/tzw_py310/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --model_name_or_path facebook/opt-1.3b --gradient_accumulation_steps 8 --lora_dim 128 --zero_stage 0 --enable_tensorboard --tensorboard_path /mnt/ssd/datagroup/tzw/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b --deepspeed --output_dir /mnt/ssd/datagroup/tzw/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b [2024-08-30 20:21:19,380] [WARNING] [real_accelerator.py:162:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it. [2024-08-30 20:21:19,389] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cpu (auto detect) [2024-08-30 20:21:20,540] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0]} [2024-08-30 20:21:20,540] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=1, node_rank=0 [2024-08-30 20:21:20,540] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2024-08-30 20:21:20,540] [INFO] [launch.py:164:main] dist_world_size=1 [2024-08-30 20:21:20,540] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0 [2024-08-30 20:21:20,541] [INFO] [launch.py:256:main] process 3047354 spawned with command: ['/mnt/ssd/datagroup/anaconda3/envs/tzw_py310/bin/python', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', 'facebook/opt-1.3b', '--gradient_accumulation_steps', '8', '--lora_dim', '128', '--zero_stage', '0', '--enable_tensorboard', '--tensorboard_path', '/mnt/ssd/datagroup/tzw/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b', '--deepspeed', '--output_dir', '/mnt/ssd/datagroup/tzw/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b'] [2024-08-30 20:21:22,397] [WARNING] [real_accelerator.py:162:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it. [2024-08-30 20:21:22,398] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cpu (auto detect) /mnt/ssd/datagroup/anaconda3/envs/tzw_py310/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( step1_supervised_finetuning_main() [2024-08-30 20:21:23,754] [INFO] [comm.py:637:init_distributed] cdb=None [2024-08-30 20:21:23,754] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend gloo Using /home/datagroup/.cache/torch_extensions/py310_cpu as PyTorch extensions root... Emitting ninja build file /home/datagroup/.cache/torch_extensions/py310_cpu/deepspeed_shm_comm/build.ninja... Building extension module deepspeed_shm_comm... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module deepspeed_shm_comm... Time to load deepspeed_shm_comm op: 0.09181809425354004 seconds DeepSpeed deepspeed.ops.comm.deepspeed_shm_comm_op built successfully /mnt/ssd/datagroup/anaconda3/envs/tzw_py310/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. warnings.warn( Using /home/datagroup/.cache/torch_extensions/py310_cpu as PyTorch extensions root... Emitting ninja build file /home/datagroup/.cache/torch_extensions/py310_cpu/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 2.5335588455200195 seconds [2024-08-30 20:21:35,513] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.5, git-hash=f447b18, git-branch=HEAD [2024-08-30 20:21:35,514] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized [rank0]: Traceback (most recent call last): [rank0]: File "/mnt/ssd/datagroup/tzw/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 398, in [rank0]: main() [rank0]: File "/mnt/ssd/datagroup/tzw/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 333, in main [rank0]: model, optimizer, _, lr_scheduler = deepspeed.initialize( [rank0]: File "/mnt/ssd/datagroup/anaconda3/envs/tzw_py310/lib/python3.10/site-packages/deepspeed/init.py", line 181, in initialize [rank0]: engine = DeepSpeedEngine(args=args, [rank0]: File "/mnt/ssd/datagroup/anaconda3/envs/tzw_py310/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 242, in init [rank0]: self._do_sanity_check() [rank0]: File "/mnt/ssd/datagroup/anaconda3/envs/tzw_py310/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1072, in _do_sanity_check [rank0]: raise ValueError("Type fp16 is not supported.") [rank0]: ValueError: Type fp16 is not supported. [2024-08-30 20:21:37,558] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 3047354 [2024-08-30 20:21:37,559] [ERROR] [launch.py:325:sigkill_handler] ['/mnt/ssd/datagroup/anaconda3/envs/tzw_py310/bin/python', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', 'facebook/opt-1.3b', '--gradient_accumulation_steps', '8', '--lora_dim', '128', '--zero_stage', '0', '--enable_tensorboard', '--tensorboard_path', '/mnt/ssd/datagroup/tzw/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b', '--deepspeed', '--output_dir', '/mnt/ssd/datagroup/tzw/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b'] exits with return code = 1 ’‘’

Aug 30 '24 12:08 zhiwentian

@zhiwentian - are you still hitting this issue? It looks like you're hitting this check, whichi s weird that fp16 wouldn't be supported on your accelerator, could you check that?

https://github.com/microsoft/DeepSpeed/blob/c7f58c899f6f099a35d968bdad973f24b842c8c6/deepspeed/runtime/engine.py#L1079

Oct 31 '24 15:10 loadams

Hi @zhiwentian - closing this for now, if you are still hitting this or someone hits this in the future please comment and we can debug more/re-open.

Nov 06 '24 18:11 loadams