DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] RuntimeError: Ninja is required to load C++ extensions

Open ShivamSharma2705 opened this issue 3 years ago • 15 comments

Hi,

I am getting the following error when running pretrain_gpt.sh


DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] sparse_attn ............ [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-devel package with yum [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] transformer_inference .. [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch'] torch version .................... 1.8.2+cu111 torch cuda version ............... 11.1 nvcc version ..................... 11.1 deepspeed install path ........... ['/qfs/people/shar703/scripts/mega_ai/deepspeed_megatron/DeepSpeed/deepspeed'] deepspeed info ................... 0.5.9+1d295ff, 1d295ff, master deepspeed wheel compiled w. ...... torch 1.8, cuda 11.1 **** Git info for Megatron: git_hash=1ac4a44 git_branch=main **** using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 using torch.float16 for parameters ... ------------------------ arguments ------------------------ accumulate_allreduce_grads_in_fp32 .............. False adam_beta1 ...................................... 0.9 adam_beta2 ...................................... 0.999 adam_eps ........................................ 1e-08 adlr_autoresume ................................. False adlr_autoresume_interval ........................ 1000 apply_query_key_layer_scaling ................... True apply_residual_connection_post_layernorm ........ False attention_dropout ............................... 0.1 attention_softmax_in_fp32 ....................... False bert_binary_head ................................ True bert_load ....................................... None bf16 ............................................ False bias_dropout_fusion ............................. True bias_gelu_fusion ................................ True biencoder_projection_dim ........................ 0 biencoder_shared_query_context_model ............ False block_data_path ................................. None checkpoint_activations .......................... True checkpoint_in_cpu ............................... False checkpoint_num_layers ........................... 1 clip_grad ....................................... 1.0 consumed_train_samples .......................... 0 consumed_train_tokens ........................... 0 consumed_valid_samples .......................... 0 contigious_checkpointing ........................ False cpu_optimizer ................................... False cpu_torch_adam .................................. False curriculum_learning ............................. False data_impl ....................................... infer data_parallel_size .............................. 1 data_path ....................................... ['cord19/chemistry_cord19_abstract_document'] dataloader_type ................................. single DDP_impl ........................................ local decoder_seq_length .............................. None deepscale ....................................... False deepscale_config ................................ None deepspeed ....................................... False deepspeed_activation_checkpointing .............. False deepspeed_config ................................ None deepspeed_mpi ................................... False distribute_checkpointed_activations ............. False distributed_backend ............................. nccl embedding_path .................................. None encoder_seq_length .............................. 1024 eod_mask_loss ................................... False eval_interval ................................... 100 eval_iters ...................................... 10 evidence_data_path .............................. None exit_duration_in_mins ........................... None exit_interval ................................... None ffn_hidden_size ................................. 4096 finetune ........................................ False fp16 ............................................ True fp16_lm_cross_entropy ........................... False fp32_residual_connection ........................ False global_batch_size ............................... 8 hidden_dropout .................................. 0.1 hidden_size ..................................... 1024 hysteresis ...................................... 2 ict_head_size ................................... None ict_load ........................................ None img_dim ......................................... 224 indexer_batch_size .............................. 128 indexer_log_interval ............................ 1000 init_method_std ................................. 0.02 init_method_xavier_uniform ...................... False initial_loss_scale .............................. 4294967296 kv_channels ..................................... 64 layernorm_epsilon ............................... 1e-05 lazy_mpu_init ................................... None load ............................................ checkpoints/gpt2_345m local_rank ...................................... None log_batch_size_to_tensorboard ................... False log_interval .................................... 10 log_learning_rate_to_tensorboard ................ True log_loss_scale_to_tensorboard ................... True log_num_zeros_in_grad ........................... False log_params_norm ................................. False log_timers_to_tensorboard ....................... False log_validation_ppl_to_tensorboard ............... False loss_scale ...................................... None loss_scale_window ............................... 1000 lr .............................................. 0.00015 lr_decay_iters .................................. 320000 lr_decay_samples ................................ None lr_decay_style .................................. cosine lr_decay_tokens ................................. None lr_warmup_fraction .............................. 0.01 lr_warmup_iters ................................. 0 lr_warmup_samples ............................... 0 make_vocab_size_divisible_by .................... 128 mask_prob ....................................... 0.15 masked_softmax_fusion ........................... True max_position_embeddings ......................... 1024 memory_centric_tiled_linear ..................... False merge_file ...................................... ../deepspeed_megatron/gpt_files/gpt2-merges.txt micro_batch_size ................................ 4 min_loss_scale .................................. 1.0 min_lr .......................................... 0.0 mmap_warmup ..................................... False no_load_optim ................................... None no_load_rng ..................................... None no_save_optim ................................... None no_save_rng ..................................... None num_attention_heads ............................. 16 num_channels .................................... 3 num_classes ..................................... 1000 num_layers ...................................... 24 num_layers_per_virtual_pipeline_stage ........... None num_workers ..................................... 2 onnx_safe ....................................... None openai_gelu ..................................... False optimizer ....................................... adam override_lr_scheduler ........................... False params_dtype .................................... torch.float16 partition_activations ........................... False patch_dim ....................................... 16 pipeline_model_parallel_size .................... 1 profile_backward ................................ False query_in_block_prob ............................. 0.1 rampup_batch_size ............................... None rank ............................................ 0 remote_device ................................... none reset_attention_mask ............................ False reset_position_ids .............................. False retriever_report_topk_accuracies ................ [] retriever_score_scaling ......................... False retriever_seq_length ............................ 256 sample_rate ..................................... 1.0 save ............................................ checkpoints/gpt2_345m save_interval ................................... 500 scatter_gather_tensors_in_pipeline .............. True scattered_embeddings ............................ False seed ............................................ 1234 seq_length ...................................... 1024 sgd_momentum .................................... 0.9 short_seq_prob .................................. 0.1 split ........................................... 969, 30, 1 split_transformers .............................. False synchronize_each_layer .......................... False tensor_model_parallel_size ...................... 1 tensorboard_dir ................................. None tensorboard_log_interval ........................ 1 tensorboard_queue_size .......................... 1000 tile_factor ..................................... 1 titles_data_path ................................ None tokenizer_type .................................. GPT2BPETokenizer train_iters ..................................... 500000 train_samples ................................... None train_tokens .................................... None use_checkpoint_lr_scheduler ..................... False use_contiguous_buffers_in_ddp ................... False use_cpu_initialization .......................... None use_one_sent_docs ............................... False use_pin_memory .................................. False virtual_pipeline_model_parallel_size ............ None vocab_extra_ids ................................. 0 vocab_file ...................................... ../deepspeed_megatron/gpt_files/gpt2-vocab.json weight_decay .................................... 0.01 world_size ...................................... 1 zero_allgather_bucket_size ...................... 0.0 zero_contigious_gradients ....................... False zero_reduce_bucket_size ......................... 0.0 zero_reduce_scatter ............................. False zero_stage ...................................... 1.0 -------------------- end of arguments --------------------- setting number of micro-batches to constant 2

building GPT2BPETokenizer tokenizer ... padded vocab (size: 50257) with 47 dummy tokens (new size: 50304) initializing torch distributed ... initializing tensor model parallel with size 1 initializing pipeline model parallel with size 1 setting random seeds to 1234 ... initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 compiling dataset index builder ... make: Entering directory /qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/data' make: Nothing to be done for default'. make: Leaving directory `/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/data'

done with dataset index builder. Compilation time: 0.051 seconds compiling and loading fused kernels ... Traceback (most recent call last): File "/people/shar703/anaconda3/envs/deepspeed/bin/ninja", line 33, in sys.exit(load_entry_point('ninja', 'console_scripts', 'ninja')()) File "/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/ninja-1.10.2.3-py3.8-linux-x86_64.egg/ninja/init.py", line 51, in ninja raise SystemExit(_program('ninja', sys.argv[1:])) File "/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/ninja-1.10.2.3-py3.8-linux-x86_64.egg/ninja/init.py", line 47, in _program return subprocess.call([os.path.join(BIN_DIR, name)] + args) File "/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/subprocess.py", line 340, in call with Popen(*popenargs, **kwargs) as p: File "/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/subprocess.py", line 858, in init self._execute_child(args, executable, preexec_fn, close_fds, File "/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/subprocess.py", line 1704, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) PermissionError: [Errno 13] Permission denied: '/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/ninja-1.10.2.3-py3.8-linux-x86_64.egg/ninja/data/bin/ninja' Traceback (most recent call last): File "pretrain_gpt.py", line 231, in pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/training.py", line 96, in pretrain initialize_megatron(extra_args_provider=extra_args_provider, File "/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/initialize.py", line 89, in initialize_megatron _compile_dependencies() File "/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/initialize.py", line 137, in _compile_dependencies fused_kernels.load(args) File "/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/fused_kernels/init.py", line 71, in load scaled_upper_triang_masked_softmax_cuda = _cpp_extention_load_helper( File "/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/fused_kernels/init.py", line 47, in _cpp_extention_load_helper return cpp_extension.load( File "/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1079, in load return _jit_compile( File "/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1292, in _jit_compile _write_ninja_file_and_build_library( File "/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1373, in _write_ninja_file_and_build_library verify_ninja_availability() File "/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1429, in verify_ninja_availability raise RuntimeError("Ninja is required to load C++ extensions") RuntimeError: Ninja is required to load C++ extensions

ShivamSharma2705 avatar Jan 10 '22 16:01 ShivamSharma2705

Do you have ninja installed? The command from pytorch that is raising this RuntimeError is attempting to run ninja --version. Does this command work for you?

jeffra avatar Jan 10 '22 18:01 jeffra

@jeffra Hi, When I have two machines in parallel, the same problem occurs; however, a single machine does not have this problem, have any tips for me?

XiaoqingNLP avatar Jan 11 '22 12:01 XiaoqingNLP

Hey @jeffra, when I was doing ninja --version, there was a permission error. The work around I found was to chmod 777 the folder it was try access, and then it worked. Was wondering if there was any other way

ShivamSharma2705 avatar Jan 11 '22 15:01 ShivamSharma2705

I can run ninja --version, but still get this error..

JiyangZhang avatar Feb 03 '22 21:02 JiyangZhang

I had the same problem when I ran deepspeed with tmux/screen

chinoll avatar May 16 '22 02:05 chinoll

deepspeed doesn't seem to load the anaconda environment variable correctly in the case of multiple nodes. For example, my ninja path is /home/xxx/anaconda3/envs/NLP/bin/ninja, but deepspeed does not add this path to the PATH environment variable.

chinoll avatar May 16 '22 02:05 chinoll

A temporary solution is to manually add the path of ninja to the PATH environment variable in the torch/utils/cpp_extension.py file

chinoll avatar May 16 '22 02:05 chinoll

@chinoll I have a similar problem. Where do you exaclty add the .../bin/ninja path in the torch/utils/cpp_extension.py file?

joanrod avatar May 17 '22 21:05 joanrod

@chinoll I have a similar problem. Where do you exaclty add the .../bin/ninja path in the torch/utils/cpp_extension.py file?

image

chinoll avatar May 18 '22 00:05 chinoll

That worked, thanks a lot @chinoll

joanrod avatar May 18 '22 07:05 joanrod

just my conjecture for my scenario, seems like deepspeed is using some cached torch extensions which point to files in an old conda environment which I no longer have access. I delete the cache rm -rf /home/ubuntu/.cache/torch_extensions/py310_cu116/ forcing DS to rebuild the extensions and it works again.

tnq177 avatar Dec 20 '22 16:12 tnq177

deepspeed doesn't seem to load the anaconda environment variable correctly in the case of multiple nodes. For example, my ninja path is /home/xxx/anaconda3/envs/NLP/bin/ninja, but deepspeed does not add this path to the PATH environment variable.

This hypothesis makes sense to me. In my case, I'm using a conda environment, and directly calling the deepspeed binary from that conda env. I guess that way the path isn't set properly. My fix is to use these lines:

ENV_PATH=/path/to/env
export PATH="${ENV_PATH}/:$PATH"
${ENV_PATH}deepspeed your_script_here

This should be less intrusive imo than modifying torch/utils/cpp_extension.py.

manestay avatar Aug 01 '23 22:08 manestay

I can run ninja --version, but still get this error..

Noted that some version of ninja has a bug that shows version correctly by return a 245 return code, which cause an exception when detecting ninja. Check with

ninja --version
echo $?

In this case, try install another version.

Dixeran avatar Sep 06 '23 02:09 Dixeran

@chinoll I have a similar problem. Where do you exaclty add the .../bin/ninja path in the torch/utils/cpp_extension.py file?

image

it works, thanks a lot

Songqiw avatar Jan 16 '24 09:01 Songqiw

I can run ninja --version, but still get this error..

I finally figured out my error by [pip install ninja] out of my virtual environment!

xingyouxin avatar May 17 '24 13:05 xingyouxin

@chinoll I have a similar problem. Where do you exaclty add the .../bin/ninja path in the torch/utils/cpp_extension.py file?

image

it works for me, thx

ywb2018 avatar May 31 '24 07:05 ywb2018