DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

FileNotFoundError: [Errno 2] No such file or directory: 'numactl'

Open zhiwentian opened this issue 1 year ago • 4 comments

After finishing install successfully, i got this error when ran this command: python e2e_rlhf.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu

捕获

how to fix it please ?

zhiwentian avatar Aug 28 '24 14:08 zhiwentian

Hi @zhiwentian - you will need to install numactl, depending on your linux distro it might be apt-get install numactl - could you try that?

loadams avatar Aug 28 '24 20:08 loadams

你好@zhiwentian- 您将需要安装 numactl,具体取决于您的 Linux 发行版apt-get install numactl- 您可以尝试一下吗?

Thank you very much, I solved the problem by installing numactl. But now there is a new error, how can I fix it? 2 DeepSpeed general environment info: deepspeed................................0.15.0 torch............................................1.13.1+cu116

zhiwentian avatar Aug 29 '24 03:08 zhiwentian

Hi @zhiwentian - we would need to have more of your repro script, but it looks like fp16 isn't supported, can you share what device you are running on?

loadams avatar Aug 29 '24 16:08 loadams

Hi @zhiwentian - we would need to have more of your repro script, but it looks like fp16 isn't supported, can you share what device you are running on?

DeepSpeed general environment info: deepspeed................................0.15.0 accelerate..................................0.33.0 torch............................................1.13.1+cu116 GPU..............................................NVIDIA A100 80GB

training.log: ‘’‘ [2024-08-30 20:21:17,019] [WARNING] [real_accelerator.py:162:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it. [2024-08-30 20:21:17,027] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cpu (auto detect) [2024-08-30 20:21:18,175] [WARNING] [runner.py:212:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2024-08-30 20:21:18,178] [INFO] [runner.py:585:main] cmd = /mnt/ssd/datagroup/anaconda3/envs/tzw_py310/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --model_name_or_path facebook/opt-1.3b --gradient_accumulation_steps 8 --lora_dim 128 --zero_stage 0 --enable_tensorboard --tensorboard_path /mnt/ssd/datagroup/tzw/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b --deepspeed --output_dir /mnt/ssd/datagroup/tzw/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b [2024-08-30 20:21:19,380] [WARNING] [real_accelerator.py:162:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it. [2024-08-30 20:21:19,389] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cpu (auto detect) [2024-08-30 20:21:20,540] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0]} [2024-08-30 20:21:20,540] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=1, node_rank=0 [2024-08-30 20:21:20,540] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2024-08-30 20:21:20,540] [INFO] [launch.py:164:main] dist_world_size=1 [2024-08-30 20:21:20,540] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0 [2024-08-30 20:21:20,541] [INFO] [launch.py:256:main] process 3047354 spawned with command: ['/mnt/ssd/datagroup/anaconda3/envs/tzw_py310/bin/python', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', 'facebook/opt-1.3b', '--gradient_accumulation_steps', '8', '--lora_dim', '128', '--zero_stage', '0', '--enable_tensorboard', '--tensorboard_path', '/mnt/ssd/datagroup/tzw/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b', '--deepspeed', '--output_dir', '/mnt/ssd/datagroup/tzw/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b'] [2024-08-30 20:21:22,397] [WARNING] [real_accelerator.py:162:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it. [2024-08-30 20:21:22,398] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cpu (auto detect) /mnt/ssd/datagroup/anaconda3/envs/tzw_py310/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( step1_supervised_finetuning_main() [2024-08-30 20:21:23,754] [INFO] [comm.py:637:init_distributed] cdb=None [2024-08-30 20:21:23,754] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend gloo Using /home/datagroup/.cache/torch_extensions/py310_cpu as PyTorch extensions root... Emitting ninja build file /home/datagroup/.cache/torch_extensions/py310_cpu/deepspeed_shm_comm/build.ninja... Building extension module deepspeed_shm_comm... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module deepspeed_shm_comm... Time to load deepspeed_shm_comm op: 0.09181809425354004 seconds DeepSpeed deepspeed.ops.comm.deepspeed_shm_comm_op built successfully /mnt/ssd/datagroup/anaconda3/envs/tzw_py310/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. warnings.warn( Using /home/datagroup/.cache/torch_extensions/py310_cpu as PyTorch extensions root... Emitting ninja build file /home/datagroup/.cache/torch_extensions/py310_cpu/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 2.5335588455200195 seconds [2024-08-30 20:21:35,513] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.5, git-hash=f447b18, git-branch=HEAD [2024-08-30 20:21:35,514] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized [rank0]: Traceback (most recent call last): [rank0]: File "/mnt/ssd/datagroup/tzw/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 398, in [rank0]: main() [rank0]: File "/mnt/ssd/datagroup/tzw/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 333, in main [rank0]: model, optimizer, _, lr_scheduler = deepspeed.initialize( [rank0]: File "/mnt/ssd/datagroup/anaconda3/envs/tzw_py310/lib/python3.10/site-packages/deepspeed/init.py", line 181, in initialize [rank0]: engine = DeepSpeedEngine(args=args, [rank0]: File "/mnt/ssd/datagroup/anaconda3/envs/tzw_py310/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 242, in init [rank0]: self._do_sanity_check() [rank0]: File "/mnt/ssd/datagroup/anaconda3/envs/tzw_py310/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1072, in _do_sanity_check [rank0]: raise ValueError("Type fp16 is not supported.") [rank0]: ValueError: Type fp16 is not supported. [2024-08-30 20:21:37,558] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 3047354 [2024-08-30 20:21:37,559] [ERROR] [launch.py:325:sigkill_handler] ['/mnt/ssd/datagroup/anaconda3/envs/tzw_py310/bin/python', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', 'facebook/opt-1.3b', '--gradient_accumulation_steps', '8', '--lora_dim', '128', '--zero_stage', '0', '--enable_tensorboard', '--tensorboard_path', '/mnt/ssd/datagroup/tzw/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b', '--deepspeed', '--output_dir', '/mnt/ssd/datagroup/tzw/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b'] exits with return code = 1 ’‘’

zhiwentian avatar Aug 30 '24 12:08 zhiwentian

@zhiwentian - are you still hitting this issue? It looks like you're hitting this check, whichi s weird that fp16 wouldn't be supported on your accelerator, could you check that?

https://github.com/microsoft/DeepSpeed/blob/c7f58c899f6f099a35d968bdad973f24b842c8c6/deepspeed/runtime/engine.py#L1079

loadams avatar Oct 31 '24 15:10 loadams

Hi @zhiwentian - closing this for now, if you are still hitting this or someone hits this in the future please comment and we can debug more/re-open.

loadams avatar Nov 06 '24 18:11 loadams