FileNotFoundError: [Errno 2] No such file or directory: 'numactl'
After finishing install successfully, i got this error when ran this command: python e2e_rlhf.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu
how to fix it please ?
Hi @zhiwentian - you will need to install numactl, depending on your linux distro it might be apt-get install numactl - could you try that?
你好@zhiwentian- 您将需要安装 numactl,具体取决于您的 Linux 发行版
apt-get install numactl- 您可以尝试一下吗?
Thank you very much, I solved the problem by installing numactl. But now there is a new error, how can I fix it?
DeepSpeed general environment info:
deepspeed................................0.15.0
torch............................................1.13.1+cu116
Hi @zhiwentian - we would need to have more of your repro script, but it looks like fp16 isn't supported, can you share what device you are running on?
Hi @zhiwentian - we would need to have more of your repro script, but it looks like fp16 isn't supported, can you share what device you are running on?
DeepSpeed general environment info: deepspeed................................0.15.0 accelerate..................................0.33.0 torch............................................1.13.1+cu116 GPU..............................................NVIDIA A100 80GB
training.log:
‘’‘
[2024-08-30 20:21:17,019] [WARNING] [real_accelerator.py:162:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2024-08-30 20:21:17,027] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cpu (auto detect)
[2024-08-30 20:21:18,175] [WARNING] [runner.py:212:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-08-30 20:21:18,178] [INFO] [runner.py:585:main] cmd = /mnt/ssd/datagroup/anaconda3/envs/tzw_py310/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --model_name_or_path facebook/opt-1.3b --gradient_accumulation_steps 8 --lora_dim 128 --zero_stage 0 --enable_tensorboard --tensorboard_path /mnt/ssd/datagroup/tzw/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b --deepspeed --output_dir /mnt/ssd/datagroup/tzw/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b
[2024-08-30 20:21:19,380] [WARNING] [real_accelerator.py:162:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2024-08-30 20:21:19,389] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cpu (auto detect)
[2024-08-30 20:21:20,540] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0]}
[2024-08-30 20:21:20,540] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=1, node_rank=0
[2024-08-30 20:21:20,540] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2024-08-30 20:21:20,540] [INFO] [launch.py:164:main] dist_world_size=1
[2024-08-30 20:21:20,540] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0
[2024-08-30 20:21:20,541] [INFO] [launch.py:256:main] process 3047354 spawned with command: ['/mnt/ssd/datagroup/anaconda3/envs/tzw_py310/bin/python', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', 'facebook/opt-1.3b', '--gradient_accumulation_steps', '8', '--lora_dim', '128', '--zero_stage', '0', '--enable_tensorboard', '--tensorboard_path', '/mnt/ssd/datagroup/tzw/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b', '--deepspeed', '--output_dir', '/mnt/ssd/datagroup/tzw/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b']
[2024-08-30 20:21:22,397] [WARNING] [real_accelerator.py:162:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2024-08-30 20:21:22,398] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cpu (auto detect)
/mnt/ssd/datagroup/anaconda3/envs/tzw_py310/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
step1_supervised_finetuning_main()
[2024-08-30 20:21:23,754] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-30 20:21:23,754] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend gloo
Using /home/datagroup/.cache/torch_extensions/py310_cpu as PyTorch extensions root...
Emitting ninja build file /home/datagroup/.cache/torch_extensions/py310_cpu/deepspeed_shm_comm/build.ninja...
Building extension module deepspeed_shm_comm...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module deepspeed_shm_comm...
Time to load deepspeed_shm_comm op: 0.09181809425354004 seconds
DeepSpeed deepspeed.ops.comm.deepspeed_shm_comm_op built successfully
/mnt/ssd/datagroup/anaconda3/envs/tzw_py310/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
Using /home/datagroup/.cache/torch_extensions/py310_cpu as PyTorch extensions root...
Emitting ninja build file /home/datagroup/.cache/torch_extensions/py310_cpu/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 2.5335588455200195 seconds
[2024-08-30 20:21:35,513] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.5, git-hash=f447b18, git-branch=HEAD
[2024-08-30 20:21:35,514] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/ssd/datagroup/tzw/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 398, in
@zhiwentian - are you still hitting this issue? It looks like you're hitting this check, whichi s weird that fp16 wouldn't be supported on your accelerator, could you check that?
https://github.com/microsoft/DeepSpeed/blob/c7f58c899f6f099a35d968bdad973f24b842c8c6/deepspeed/runtime/engine.py#L1079
Hi @zhiwentian - closing this for now, if you are still hitting this or someone hits this in the future please comment and we can debug more/re-open.