Finetune_LLMs icon indicating copy to clipboard operation
Finetune_LLMs copied to clipboard

`RuntimeError: Error building extension 'cpu_adam'AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'

Open mycelium-networks opened this issue 2 years ago • 0 comments

I can't figure out how to fix this error. I am trying to run the example run.txt from here https://github.com/mallorbc/Finetune_GPTNEO_GPTJ6B/blob/main/finetuning_repo/example_run.txt

I run it and get this error, it has an error building with cpu_adam

`RuntimeError: Error building extension 'cpu_adam' Exception ignored in: <function DeepSpeedCPUAdam.del at 0x7f37056321f0> Traceback (most recent call last): File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/ops/adam/cpu_adam.py", line 97, in del

here is the full traceback

`Using /home/ubuntu/.cache/torch_extensions/py38_cu116 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py38_cu116/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/2] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1013" -I/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/usr/include -isystem /usr/lib/python3/dist-packages/torch/include -isystem /usr/lib/python3/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/lib/python3/dist-packages/torch/include/TH -isystem /usr/lib/python3/dist-packages/torch/include/THC -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/usr/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256 -c /home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o FAILED: cpu_adam.o c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1013" -I/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/usr/include -isystem /usr/lib/python3/dist-packages/torch/include -isystem /usr/lib/python3/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/lib/python3/dist-packages/torch/include/TH -isystem /usr/lib/python3/dist-packages/torch/include/THC -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/usr/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256 -c /home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o In file included from /usr/lib/python3/dist-packages/torch/include/torch/csrc/api/include/torch/python.h:12, from /usr/lib/python3/dist-packages/torch/include/torch/extension.h:6, from /home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp:5: /usr/lib/python3/dist-packages/torch/include/torch/csrc/utils/pybind.h:7:10: fatal error: pybind11/pybind11.h: No such file or directory 7 | #include <pybind11/pybind11.h> | ^~~~~~~~~~~~~~~~~~~~~ compilation terminated. ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/usr/lib/python3/dist-packages/torch/utils/cpp_extension.py", line 1740, in _run_ninja_build subprocess.run( File "/usr/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "run_clm.py", line 485, in main() File "run_clm.py", line 448, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1165, in train deepspeed_engine, optimizer, lr_scheduler = deepspeed_init( File "/usr/local/lib/python3.8/dist-packages/transformers/deepspeed.py", line 426, in deepspeed_init deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/init.py", line 120, in initialize engine = DeepSpeedEngine(args=args, File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in init self._configure_optimizer(optimizer, model_parameters) File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1098, in _configure_optimizer basic_optimizer = self._configure_basic_optimizer(model_parameters) File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1186, in _configure_basic_optimizer optimizer = DeepSpeedCPUAdam(model_parameters, File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/ops/adam/cpu_adam.py", line 83, in init self.ds_opt_adam = CPUAdamBuilder().load() File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 463, in load return self.jit_load(verbose) File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 505, in jit_load op_module = load( File "/usr/lib/python3/dist-packages/torch/utils/cpp_extension.py", line 1144, in load return _jit_compile( File "/usr/lib/python3/dist-packages/torch/utils/cpp_extension.py", line 1357, in _jit_compile _write_ninja_file_and_build_library( File "/usr/lib/python3/dist-packages/torch/utils/cpp_extension.py", line 1469, in _write_ninja_file_and_build_library _run_ninja_build( File "/usr/lib/python3/dist-packages/torch/utils/cpp_extension.py", line 1756, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'cpu_adam' Exception ignored in: <function DeepSpeedCPUAdam.del at 0x7f33ab3681f0> Traceback (most recent call last): File "/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/ops/adam/cpu_adam.py", line 97, in del AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam' [2022-06-09 21:12:00,293] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 61288 [2022-06-09 21:12:00,293] [ERROR] [launch.py:184:sigkill_handler] ['/usr/bin/python3', '-u', 'run_clm.py', '--local_rank=0', '--deepspeed', 'ds_config_gptj6b.json', '--model_name_or_path', 'EleutherAI/gpt-j-6B', '--train_file', 'train.csv', '--validation_file', 'validation.csv', '--do_train', '--do_eval', '--fp16', '--overwrite_cache', '--evaluation_strategy=steps', '--output_dir', 'finetuned', '--num_train_epochs', '12', '--eval_steps', '1', '--gradient_accumulation_steps', '32', '--per_device_train_batch_size', '1', '--use_fast_tokenizer', 'False', '--learning_rate', '5e-06', '--warmup_steps', '10', '--save_total_limit', '20', '--save_steps', '2', '--save_strategy', 'steps', '--tokenizer_name', 'gpt2'] exits with return code = 1 ubuntu@104-171-200-151:~/ai-storage/Finetune_GPTNEO_GPTJ6B/fin`

mycelium-networks avatar Jun 09 '22 21:06 mycelium-networks