DeepSpeed
DeepSpeed copied to clipboard
[BUG] RuntimeError: Ninja is required to load C++ extensions
Hi,
I am getting the following error when running pretrain_gpt.sh
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja ninja .................. [OKAY]
op name ................ installed .. compatible
cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] sparse_attn ............ [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-devel package with yum [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] transformer_inference .. [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY]
DeepSpeed general environment info: torch install path ............... ['/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch'] torch version .................... 1.8.2+cu111 torch cuda version ............... 11.1 nvcc version ..................... 11.1 deepspeed install path ........... ['/qfs/people/shar703/scripts/mega_ai/deepspeed_megatron/DeepSpeed/deepspeed'] deepspeed info ................... 0.5.9+1d295ff, 1d295ff, master deepspeed wheel compiled w. ...... torch 1.8, cuda 11.1 **** Git info for Megatron: git_hash=1ac4a44 git_branch=main **** using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 using torch.float16 for parameters ... ------------------------ arguments ------------------------ accumulate_allreduce_grads_in_fp32 .............. False adam_beta1 ...................................... 0.9 adam_beta2 ...................................... 0.999 adam_eps ........................................ 1e-08 adlr_autoresume ................................. False adlr_autoresume_interval ........................ 1000 apply_query_key_layer_scaling ................... True apply_residual_connection_post_layernorm ........ False attention_dropout ............................... 0.1 attention_softmax_in_fp32 ....................... False bert_binary_head ................................ True bert_load ....................................... None bf16 ............................................ False bias_dropout_fusion ............................. True bias_gelu_fusion ................................ True biencoder_projection_dim ........................ 0 biencoder_shared_query_context_model ............ False block_data_path ................................. None checkpoint_activations .......................... True checkpoint_in_cpu ............................... False checkpoint_num_layers ........................... 1 clip_grad ....................................... 1.0 consumed_train_samples .......................... 0 consumed_train_tokens ........................... 0 consumed_valid_samples .......................... 0 contigious_checkpointing ........................ False cpu_optimizer ................................... False cpu_torch_adam .................................. False curriculum_learning ............................. False data_impl ....................................... infer data_parallel_size .............................. 1 data_path ....................................... ['cord19/chemistry_cord19_abstract_document'] dataloader_type ................................. single DDP_impl ........................................ local decoder_seq_length .............................. None deepscale ....................................... False deepscale_config ................................ None deepspeed ....................................... False deepspeed_activation_checkpointing .............. False deepspeed_config ................................ None deepspeed_mpi ................................... False distribute_checkpointed_activations ............. False distributed_backend ............................. nccl embedding_path .................................. None encoder_seq_length .............................. 1024 eod_mask_loss ................................... False eval_interval ................................... 100 eval_iters ...................................... 10 evidence_data_path .............................. None exit_duration_in_mins ........................... None exit_interval ................................... None ffn_hidden_size ................................. 4096 finetune ........................................ False fp16 ............................................ True fp16_lm_cross_entropy ........................... False fp32_residual_connection ........................ False global_batch_size ............................... 8 hidden_dropout .................................. 0.1 hidden_size ..................................... 1024 hysteresis ...................................... 2 ict_head_size ................................... None ict_load ........................................ None img_dim ......................................... 224 indexer_batch_size .............................. 128 indexer_log_interval ............................ 1000 init_method_std ................................. 0.02 init_method_xavier_uniform ...................... False initial_loss_scale .............................. 4294967296 kv_channels ..................................... 64 layernorm_epsilon ............................... 1e-05 lazy_mpu_init ................................... None load ............................................ checkpoints/gpt2_345m local_rank ...................................... None log_batch_size_to_tensorboard ................... False log_interval .................................... 10 log_learning_rate_to_tensorboard ................ True log_loss_scale_to_tensorboard ................... True log_num_zeros_in_grad ........................... False log_params_norm ................................. False log_timers_to_tensorboard ....................... False log_validation_ppl_to_tensorboard ............... False loss_scale ...................................... None loss_scale_window ............................... 1000 lr .............................................. 0.00015 lr_decay_iters .................................. 320000 lr_decay_samples ................................ None lr_decay_style .................................. cosine lr_decay_tokens ................................. None lr_warmup_fraction .............................. 0.01 lr_warmup_iters ................................. 0 lr_warmup_samples ............................... 0 make_vocab_size_divisible_by .................... 128 mask_prob ....................................... 0.15 masked_softmax_fusion ........................... True max_position_embeddings ......................... 1024 memory_centric_tiled_linear ..................... False merge_file ...................................... ../deepspeed_megatron/gpt_files/gpt2-merges.txt micro_batch_size ................................ 4 min_loss_scale .................................. 1.0 min_lr .......................................... 0.0 mmap_warmup ..................................... False no_load_optim ................................... None no_load_rng ..................................... None no_save_optim ................................... None no_save_rng ..................................... None num_attention_heads ............................. 16 num_channels .................................... 3 num_classes ..................................... 1000 num_layers ...................................... 24 num_layers_per_virtual_pipeline_stage ........... None num_workers ..................................... 2 onnx_safe ....................................... None openai_gelu ..................................... False optimizer ....................................... adam override_lr_scheduler ........................... False params_dtype .................................... torch.float16 partition_activations ........................... False patch_dim ....................................... 16 pipeline_model_parallel_size .................... 1 profile_backward ................................ False query_in_block_prob ............................. 0.1 rampup_batch_size ............................... None rank ............................................ 0 remote_device ................................... none reset_attention_mask ............................ False reset_position_ids .............................. False retriever_report_topk_accuracies ................ [] retriever_score_scaling ......................... False retriever_seq_length ............................ 256 sample_rate ..................................... 1.0 save ............................................ checkpoints/gpt2_345m save_interval ................................... 500 scatter_gather_tensors_in_pipeline .............. True scattered_embeddings ............................ False seed ............................................ 1234 seq_length ...................................... 1024 sgd_momentum .................................... 0.9 short_seq_prob .................................. 0.1 split ........................................... 969, 30, 1 split_transformers .............................. False synchronize_each_layer .......................... False tensor_model_parallel_size ...................... 1 tensorboard_dir ................................. None tensorboard_log_interval ........................ 1 tensorboard_queue_size .......................... 1000 tile_factor ..................................... 1 titles_data_path ................................ None tokenizer_type .................................. GPT2BPETokenizer train_iters ..................................... 500000 train_samples ................................... None train_tokens .................................... None use_checkpoint_lr_scheduler ..................... False use_contiguous_buffers_in_ddp ................... False use_cpu_initialization .......................... None use_one_sent_docs ............................... False use_pin_memory .................................. False virtual_pipeline_model_parallel_size ............ None vocab_extra_ids ................................. 0 vocab_file ...................................... ../deepspeed_megatron/gpt_files/gpt2-vocab.json weight_decay .................................... 0.01 world_size ...................................... 1 zero_allgather_bucket_size ...................... 0.0 zero_contigious_gradients ....................... False zero_reduce_bucket_size ......................... 0.0 zero_reduce_scatter ............................. False zero_stage ...................................... 1.0 -------------------- end of arguments --------------------- setting number of micro-batches to constant 2
building GPT2BPETokenizer tokenizer ... padded vocab (size: 50257) with 47 dummy tokens (new size: 50304) initializing torch distributed ... initializing tensor model parallel with size 1 initializing pipeline model parallel with size 1 setting random seeds to 1234 ... initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 compiling dataset index builder ... make: Entering directory
/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/data' make: Nothing to be done fordefault'. make: Leaving directory `/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/data'done with dataset index builder. Compilation time: 0.051 seconds compiling and loading fused kernels ... Traceback (most recent call last): File "/people/shar703/anaconda3/envs/deepspeed/bin/ninja", line 33, in
sys.exit(load_entry_point('ninja', 'console_scripts', 'ninja')()) File "/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/ninja-1.10.2.3-py3.8-linux-x86_64.egg/ninja/init.py", line 51, in ninja raise SystemExit(_program('ninja', sys.argv[1:])) File "/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/ninja-1.10.2.3-py3.8-linux-x86_64.egg/ninja/init.py", line 47, in _program return subprocess.call([os.path.join(BIN_DIR, name)] + args) File "/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/subprocess.py", line 340, in call with Popen(*popenargs, **kwargs) as p: File "/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/subprocess.py", line 858, in init self._execute_child(args, executable, preexec_fn, close_fds, File "/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/subprocess.py", line 1704, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) PermissionError: [Errno 13] Permission denied: '/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/ninja-1.10.2.3-py3.8-linux-x86_64.egg/ninja/data/bin/ninja' Traceback (most recent call last): File "pretrain_gpt.py", line 231, in pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/training.py", line 96, in pretrain initialize_megatron(extra_args_provider=extra_args_provider, File "/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/initialize.py", line 89, in initialize_megatron _compile_dependencies() File "/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/initialize.py", line 137, in _compile_dependencies fused_kernels.load(args) File "/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/fused_kernels/init.py", line 71, in load scaled_upper_triang_masked_softmax_cuda = _cpp_extention_load_helper( File "/qfs/people/shar703/scripts/mega_ai/Megatron-DeepSpeed/megatron/fused_kernels/init.py", line 47, in _cpp_extention_load_helper return cpp_extension.load( File "/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1079, in load return _jit_compile( File "/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1292, in _jit_compile _write_ninja_file_and_build_library( File "/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1373, in _write_ninja_file_and_build_library verify_ninja_availability() File "/people/shar703/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1429, in verify_ninja_availability raise RuntimeError("Ninja is required to load C++ extensions") RuntimeError: Ninja is required to load C++ extensions
Do you have ninja installed? The command from pytorch that is raising this RuntimeError is attempting to run ninja --version. Does this command work for you?
@jeffra Hi, When I have two machines in parallel, the same problem occurs; however, a single machine does not have this problem, have any tips for me?
Hey @jeffra, when I was doing ninja --version, there was a permission error. The work around I found was to chmod 777 the folder it was try access, and then it worked. Was wondering if there was any other way
I can run ninja --version, but still get this error..
I had the same problem when I ran deepspeed with tmux/screen
deepspeed doesn't seem to load the anaconda environment variable correctly in the case of multiple nodes. For example, my ninja path is /home/xxx/anaconda3/envs/NLP/bin/ninja, but deepspeed does not add this path to the PATH environment variable.
A temporary solution is to manually add the path of ninja to the PATH environment variable in the torch/utils/cpp_extension.py file
@chinoll I have a similar problem. Where do you exaclty add the .../bin/ninja path in the torch/utils/cpp_extension.py file?
@chinoll I have a similar problem. Where do you exaclty add the .../bin/ninja path in the torch/utils/cpp_extension.py file?

That worked, thanks a lot @chinoll
just my conjecture for my scenario, seems like deepspeed is using some cached torch extensions which point to files in an old conda environment which I no longer have access. I delete the cache rm -rf /home/ubuntu/.cache/torch_extensions/py310_cu116/ forcing DS to rebuild the extensions and it works again.
deepspeed doesn't seem to load the anaconda environment variable correctly in the case of multiple nodes. For example, my ninja path is /home/xxx/anaconda3/envs/NLP/bin/ninja, but deepspeed does not add this path to the PATH environment variable.
This hypothesis makes sense to me. In my case, I'm using a conda environment, and directly calling the deepspeed binary from that conda env. I guess that way the path isn't set properly. My fix is to use these lines:
ENV_PATH=/path/to/env
export PATH="${ENV_PATH}/:$PATH"
${ENV_PATH}deepspeed your_script_here
This should be less intrusive imo than modifying torch/utils/cpp_extension.py.
I can run ninja --version, but still get this error..
Noted that some version of ninja has a bug that shows version correctly by return a 245 return code, which cause an exception when detecting ninja. Check with
ninja --version
echo $?
In this case, try install another version.
@chinoll I have a similar problem. Where do you exaclty add the .../bin/ninja path in the torch/utils/cpp_extension.py file?
it works, thanks a lot
I can run ninja --version, but still get this error..
I finally figured out my error by [pip install ninja] out of my virtual environment!
@chinoll I have a similar problem. Where do you exaclty add the .../bin/ninja path in the torch/utils/cpp_extension.py file?
it works for me, thx