DeepSpeed Error building extension 'cpu

Hey guys, I'm having a problem getting DeepSpeed working with XLM-Roberta. I'm trying to run it on an Amazon Linux machine, which is based on Red Hat. Here are a some versions of packages/dependencies I'm using:

cuda version: 10.2 transformers: 4.4.2 pytorch: 1.7.1 deepspeed: 0.3.13 gcc/c++/g++: (GCC) 7.2.1 20170915 (Red Hat 7.2.1-2)

I must admit I had some issues upgrading the CUDA version from the default 10.0 on the instance to 10.2 and GCC from 4.8.5 to 7.2.1 but since I don't get the error that the torch and installed CUDA versions are different and that GCC has a version lower than 5, I'd assume I'm in the clear.

Here's the essential part of the code I'm running (from a notebook):

import os
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '9994' # modify if RuntimeError: Address already in use
os.environ['RANK'] = "0"
os.environ['LOCAL_RANK'] = "0"
os.environ['WORLD_SIZE'] = "1"

from transformers import Trainer, TrainingArguments, XLMRobertaForSequenceClassification, XLMRobertaTokenizer

model = XLMRobertaForSequenceClassification.from_pretrained('xlm-roberta-base')

training_args = TrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    save_steps=500,
    save_total_limit=2,
    deepspeed="my_ds_config.json"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

trainer.train()

Here's the content of my config file:

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "zero_optimization": {
        "stage": 2,
       "allgather_partitions": true,
       "allgather_bucket_size": 2e8,
       "reduce_scatter": true,
       "reduce_bucket_size": 2e8,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "cpu_offload": true
    },

    "optimizer": {
        "type": "Adam",
        "params": {
            "adam_w_mode": true,
            "lr": 3e-5,
            "betas": [ 0.9, 0.999 ],
            "eps": 1e-8,
            "weight_decay": 3e-7
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 3e-5,
            "warmup_num_steps": 500
        }
    }
}

Here's the output of my ds_config:

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
/bin/sh: line 0: type: llvm-config: not found
/bin/sh: line 0: type: llvm-config-9: not found
 [WARNING]  sparse_attn requires one of the following commands '['llvm-config', 'llvm-config-9']', but it does not exist!
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch']
torch version .................... 1.7.1
torch cuda version ............... 10.2
nvcc version ..................... 10.2
deepspeed install path ........... ['/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed']
deepspeed info ................... 0.3.13+22d5a1f, 22d5a1f, master
deepspeed wheel compiled w. ...... torch 1.7, cuda 10.2

And finally, here's the stack trace:

[2021-03-24 15:29:36,478] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.13, git-hash=unknown, git-branch=unknown
[2021-03-24 15:29:36,494] [INFO] [engine.py:77:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
Using /home/ec2-user/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module cpu_adam, skipping build step...
Loading extension module cpu_adam...
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-131-cc14ac05ecbb> in <module>
     30 )
     31 
---> 32 trainer.train()

~/anaconda3/envs/python3/lib/python3.6/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
    901         delay_optimizer_creation = self.sharded_ddp is not None and self.sharded_ddp != ShardedDDPOption.SIMPLE
    902         if self.args.deepspeed:
--> 903             model, optimizer, lr_scheduler = init_deepspeed(self, num_training_steps=max_steps)
    904             self.model = model.module
    905             self.model_wrapped = model  # will get further wrapped in DDP

~/anaconda3/envs/python3/lib/python3.6/site-packages/transformers/integrations.py in init_deepspeed(trainer, num_training_steps)
    416         model=model,
    417         model_parameters=model_parameters,
--> 418         config_params=config,
    419     )
    420 

~/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/__init__.py in initialize(args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config_params)
    123                                  dist_init_required=dist_init_required,
    124                                  collate_fn=collate_fn,
--> 125                                  config_params=config_params)
    126     else:
    127         assert mpu is None, "mpu must be None with pipeline parallelism"

~/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/runtime/engine.py in __init__(self, args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config_params, dont_change_device)
    181         self.lr_scheduler = None
    182         if model_parameters or optimizer:
--> 183             self._configure_optimizer(optimizer, model_parameters)
    184             self._configure_lr_scheduler(lr_scheduler)
    185             self._report_progress(0)

~/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/runtime/engine.py in _configure_optimizer(self, client_optimizer, model_parameters)
    596                 logger.info('Using client Optimizer as basic optimizer')
    597         else:
--> 598             basic_optimizer = self._configure_basic_optimizer(model_parameters)
    599             if self.global_rank == 0:
    600                 logger.info(

~/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/runtime/engine.py in _configure_basic_optimizer(self, model_parameters)
    665                     optimizer = DeepSpeedCPUAdam(model_parameters,
    666                                                  **optimizer_parameters,
--> 667                                                  adamw_mode=effective_adam_w_mode)
    668                 else:
    669                     from deepspeed.ops.adam import FusedAdam

~/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/adam/cpu_adam.py in __init__(self, model_params, lr, bias_correction, betas, eps, weight_decay, amsgrad, adamw_mode)
     76         DeepSpeedCPUAdam.optimizer_id = DeepSpeedCPUAdam.optimizer_id + 1
     77         self.adam_w_mode = adamw_mode
---> 78         self.ds_opt_adam = CPUAdamBuilder().load()
     79 
     80         self.ds_opt_adam.create_adam(self.opt_id,

~/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py in load(self, verbose)
    213             return importlib.import_module(self.absolute_name())
    214         else:
--> 215             return self.jit_load(verbose)
    216 
    217     def jit_load(self, verbose=True):

~/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py in jit_load(self, verbose)
    250             extra_cuda_cflags=self.nvcc_args(),
    251             extra_ldflags=self.extra_ldflags(),
--> 252             verbose=verbose)
    253         build_duration = time.time() - start_build
    254         if verbose:

~/anaconda3/envs/python3/lib/python3.6/site-packages/torch/utils/cpp_extension.py in load(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, is_standalone, keep_intermediates)
   1089     if isinstance(cuda_sources, str):
   1090         cuda_sources = [cuda_sources]
-> 1091 
   1092     cpp_sources.insert(0, '#include <torch/extension.h>')
   1093 

~/anaconda3/envs/python3/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _jit_compile(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, is_standalone, keep_intermediates)
   1315 
   1316 
-> 1317 def verify_ninja_availability():
   1318     r'''
   1319     Raises ``RuntimeError`` if `ninja <https://ninja-build.org/>`_ build system is not

~/anaconda3/envs/python3/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _import_module_from_library(module_name, path, is_python_module)
   1697                       sources,
   1698                       objects,
-> 1699                       ldflags,
   1700                       library_target,
   1701                       with_cuda) -> None:

~/anaconda3/envs/python3/lib/python3.6/imp.py in find_module(name, path)
    295         break  # Break out of outer loop when breaking out of inner loop.
    296     else:
--> 297         raise ImportError(_ERR_MSG.format(name), name=name)
    298 
    299     encoding = None

ImportError: No module named 'cpu_adam'

Thanks in advance for your help!

Mar 24 '21 17:03 arthur-morgan-712

Hi @arthur-morgan-712

I am not sure what exactly is happening here, since in the trace log I am seeing it says that it is trying to load the extension module cpu-adam, however the ds-report says it is not installed! I am thinking maybe this might be a problem of the system caching this module somehow! Can you remove the /tmp/torch_extensions folder (rm -rf /tmp/torch_extentions/*) and retry this?

Thanks. Reza

Mar 24 '21 23:03 RezaYazdaniAminabadi

Hi Reza,

Thanks for your reply! I tried removing it, it did have cpu_adam as a subfolder, however the issue still persists.

Funnily enough, I tried running this on Colab and it seemed to have loaded it:

Loading extension module cpu_adam...
Time to load cpu_adam op: 33.38163924217224 seconds
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000030, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1

And Colab's ds_report also states that it hasn't been installed:

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires one of the following commands '['llvm-config', 'llvm-config-9']', but it does not exist!
 [WARNING]  sparse_attn requires CUDA version 10.1+, does not currently support >=11 or <10.1
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.7/dist-packages/torch']
torch version .................... 1.7.1+cu110
torch cuda version ............... 11.0
nvcc version ..................... 11.0
deepspeed install path ........... ['/usr/local/lib/python3.7/dist-packages/deepspeed']
deepspeed info ................... 0.3.13+22d5a1f, 22d5a1f, master
deepspeed wheel compiled w. ...... torch 1.7, cuda 11.0

Albeit, I did get another error on Colab (tcmalloc: large alloc 1200709632 bytes == .... Killing subprocess 346) but I assume it's because the 11 GB video card was still rather small for this task. So maybe cpu_adam not showing up as installed isn't out of the ordinary?

Mar 25 '21 09:03 arthur-morgan-712

Quick update, I ran the command from the notebook (running run_seq2seq.py) on the Amazon Linux instance and it seems to have provided a more detailed stack trace, I'm pasting it here:

Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[1/2] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda-10.2/include -isystem /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/include -isystem /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/include/TH -isystem /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda-10.2/include -isystem /home/ec2-user/anaconda3/envs/python3/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -L/usr/local/cuda-10.2/lib64 -lcudart -lcublas -g -Wno-reorder -march=native -fopenmp -D__AVX256__ -c /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
FAILED: cpu_adam.o 
c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda-10.2/include -isystem /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/include -isystem /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/include/TH -isystem /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda-10.2/include -isystem /home/ec2-user/anaconda3/envs/python3/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -L/usr/local/cuda-10.2/lib64 -lcudart -lcublas -g -Wno-reorder -march=native -fopenmp -D__AVX256__ -c /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp:4:10: fatal error: omp.h: No such file or directory
 #include <omp.h>
          ^~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1539, in _run_ninja_build
    env=env)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "examples/seq2seq/run_seq2seq.py", line 650, in <module>
    main()
  File "examples/seq2seq/run_seq2seq.py", line 590, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/ec2-user/SageMaker/checkstep-research/t5/transformers/src/transformers/trainer.py", line 892, in train
    model, optimizer, lr_scheduler = init_deepspeed(self, num_training_steps=max_steps)
  File "/home/ec2-user/SageMaker/checkstep-research/t5/transformers/src/transformers/integrations.py", line 417, in init_deepspeed
    config_params=config,
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/__init__.py", line 125, in initialize
    config_params=config_params)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 183, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 598, in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 667, in _configure_basic_optimizer
    adamw_mode=effective_adam_w_mode)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/adam/cpu_adam.py", line 78, in __init__
    self.ds_opt_adam = CPUAdamBuilder().load()
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 215, in load
    return self.jit_load(verbose)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 252, in jit_load
    verbose=verbose)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 997, in load
    keep_intermediates=keep_intermediates)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1202, in _jit_compile
    with_cuda=with_cuda)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1300, in _write_ninja_file_and_build_library
    error_prefix="Error building extension '{}'".format(name))
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1555, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'cpu_adam'
Killing subprocess 6428
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/launcher/launch.py", line 171, in <module>
    main()
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/launcher/launch.py", line 161, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/launcher/launch.py", line 139, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ec2-user/anaconda3/envs/python3/bin/python3.6', '-u', 'examples/seq2seq/run_seq2seq.py', '--local_rank=0', '--model_name_or_path', 'google/mt5-small', '--output_dir', 'output_dir', '--adam_eps', '1e-06', '--evaluation_strategy=steps', '--do_train', '--label_smoothing', '0.1', '--learning_rate', '3e-5', '--logging_first_step', '--logging_steps', '1000', '--max_source_length', '128', '--max_target_length', '128', '--num_train_epochs', '1', '--overwrite_output_dir', '--per_device_train_batch_size', '16', '--predict_with_generate', '--sortish_sampler', '--val_max_target_length', '128', '--warmup_steps', '500', '--max_train_samples', '2000', '--max_val_samples', '500', '--task', 'translation_en_to_ro', '--dataset_name', 'wmt16', '--dataset_config', 'ro-en', '--source_prefix', 'translate English to Romanian: ', '--deepspeed', 'ds_config.json', '--fp16']' returned non-zero exit status 1.

Mar 25 '21 10:03 arthur-morgan-712

I was able to get the example from the notebook going after I downgraded the DeepSpeed version to 0.3.10. I do have a follow-up question though: Correct me if I'm wrong, but the only way to use DeepSpeed would be to use the HuggingFace Trainer class? At least that's what I can find on HuggingFace (also trying to implement a custom script without using Trainer resulted in an unrecognized arguments: --local_rank=0 error, even though I wasn't passing that argument). If that's the case, does this mean I cannot use DeepSpeed to train a classification model passing different class weights to each class, because there's no such parameter to pass to Trainer?

Mar 25 '21 15:03 arthur-morgan-712

We already have examples for running for some transformer networks. For this argument, I think you might just add local_rank to your parser arguments the same as here.

Mar 25 '21 16:03 RezaYazdaniAminabadi

I was able to get the example from the notebook going after I downgraded the DeepSpeed version to 0.3.10.

So is this the issue with 0.3.13 of DeepSpeed? Because I'm facing the same issue as well with 0.3.13.

Also, are you able to run HuggingFace-4.4.2 with DeepSpeed-0.3.10 ? I think you should've downgraded to HuggingFace-4.3.x.

Mar 26 '21 09:03 saichandrapandraju

Hi @arthur-morgan-712 ,

Could you try building deepspeed ops while installing as suggested here?

Mar 26 '21 12:03 saichandrapandraju

We already have examples for running for some transformer networks. For this argument, I think you might just add local_rank to your parser arguments the same as here.

This is no longer needed in deepspeed since https://github.com/microsoft/DeepSpeed/pull/825 and transformers master has been adjusted accordingly. You just need to have env LOCAL_RANK to be set.

I do have a follow-up question though: Correct me if I'm wrong, but the only way to use DeepSpeed would be to use the HuggingFace Trainer class?

Not at all. You can do your own integration and not rely on the HF Trainer.

If you do use transformers Trainer for a time being while this is all new you must use the transformers master branch as frequent deepspeed-related updates are made.

If you have build problems please make sure you read: https://huggingface.co/transformers/main_classes/trainer.html#installation-notes though looking at OP I think you have all the right components. Just check that PATH/LD_LIBRARY_PATH are good.

Perhaps try to pre-build deepspeed: https://github.com/microsoft/DeepSpeed/issues/885#issuecomment-808339237

Mar 26 '21 16:03 stas00

Thanks @stas00 for clarifying this : )

Mar 26 '21 16:03 RezaYazdaniAminabadi

And the error is right there in your report: https://github.com/microsoft/DeepSpeed/issues/889#issuecomment-806526657

c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda-10.2/include -isystem /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/include -isystem /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/include/TH -isystem /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda-10.2/include -isystem /home/ec2-user/anaconda3/envs/python3/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -L/usr/local/cuda-10.2/lib64 -lcudart -lcublas -g -Wno-reorder -march=native -fopenmp -D__AVX256 -c /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp:4:10:

fatal error: omp.h: No such file or directory #include <omp.h>

You're missing the right build tools. omp.h is missing. It should be in your gcc dev package. e.g. on my machine it's under:

/usr/lib/gcc/x86_64-linux-gnu/6/include/omp.h
/usr/lib/gcc/x86_64-linux-gnu/7/include/omp.h
/usr/lib/gcc/x86_64-linux-gnu/9/include/omp.h

@RezaYazdaniAminabadi, perhaps ds_report could somehow check that all the build components are there? e.g. it could attempt to build some dummy very basic extension when run? not sure on the details. Clearly here gcc-dev tools are either missing or misconfigured.

Mar 26 '21 16:03 stas00

Still running into: ImportError: No module named 'cpu_adam'

CUDA 10.2 deepspeed 0.4.3 pytorch 1.8.1

Tried down grading deepspeed to 0.3.10 and ran into: "deepspeed>=0.4.3 is required for a normal functioning of this module, but found deepspeed==0.3.10."

Any other potential solutions to date or still open??

Jul 15 '21 03:07 Alx-AI

I also occur that. before AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam' The error show: cannot make a dir in /tmp/torch_extensions/build for cpu_adam. So I change the DEFAULT_TORCH_EXTENSION_PATH in the file /anaconda3/envs/XXXXX/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py then it works

Nov 17 '21 04:11 ziweiji

This sounds like a permission issue. Try to set TMPDIR to another dir that is writable by you?

e.g.:

mkdir ~/tmp
export TMPDIR=~/tmp
... do the build here ...

Nov 17 '21 04:11 stas00

Try to set TMPDIR to another dir that is writable by you?

That won't work because deepspeed hardcodes the default extension path to be /tmp/torch_extensions

The default is not used if TORCH_EXTENSIONS_DIR is set in the environment, but it would certainly be an improvement for it to follow TMPDIR (especially as TORCH is not in the whitelist of environment variable prefixes which the deepspeed distributed launcher automatically plumbs through to worker processes)

Jul 01 '22 09:07 rowanworth

@stas00 @RezaYazdaniAminabadi huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:

Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) [INFO|trainer.py:414] 2023-01-09 19:06:58,180 >> Using amp fp16 backend [2023-01-09 19:06:58,187] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.7.7, git-hash=unknown, git-branch=unknown [2023-01-09 19:06:58,191] [WARNING] [config_utils.py:67:process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead [2023-01-09 19:07:05,242] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... Creating extension directory /root/.cache/torch_extensions/py38_cu117/cpu_adam... huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py38_cu117/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) [1/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/root/anaconda3/envs/bitten/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/TH -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/THC -isystem /root/anaconda3/envs/bitten/include -isystem /root/anaconda3/envs/bitten/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/root/anaconda3/envs/bitten/lib64 -lcudart -lcublas -g -march=native -fopenmp -DAVX256 -c /root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o FAILED: cpu_adam.o c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/root/anaconda3/envs/bitten/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/TH -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/THC -isystem /root/anaconda3/envs/bitten/include -isystem /root/anaconda3/envs/bitten/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/root/anaconda3/envs/bitten/lib64 -lcudart -lcublas -g -march=native -fopenmp -DAVX256 -c /root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o In file included from /root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/context.h:11:0, from /root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/custom_cuda_layers.h:16, from /root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/cpu_adam.h:11, from /root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp:1: /root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/gemm_test.h:6:10: fatal error: cuda_profiler_api.h: No such file or directory #include <cuda_profiler_api.h> ^~~~~~~~~ compilation terminated. [2/3] /root/anaconda3/envs/bitten/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/root/anaconda3/envs/bitten/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/TH -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/THC -isystem /root/anaconda3/envs/bitten/include -isystem /root/anaconda3/envs/bitten/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -D_CUDA_NO_HALF_CONVERSIONS -D_CUDA_NO_BFLOAT16_CONVERSIONS -D_CUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U_CUDA_NO_HALF_OPERATORS -U_CUDA_NO_HALF_CONVERSIONS_ -U_CUDA_NO_HALF2_OPERATORS_ -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o FAILED: custom_cuda_kernel.cuda.o /root/anaconda3/envs/bitten/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/root/anaconda3/envs/bitten/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/TH -isystem /root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/include/THC -isystem /root/anaconda3/envs/bitten/include -isystem /root/anaconda3/envs/bitten/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -D_CUDA_NO_HALF_CONVERSIONS -D_CUDA_NO_BFLOAT16_CONVERSIONS_ -D_CUDA_NO_HALF2_OPERATORS_ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U_CUDA_NO_HALF_OPERATORS_ -U_CUDA_NO_HALF_CONVERSIONS_ -U_CUDA_NO_HALF2_OPERATORS_ -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o In file included from /root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/context.h:11:0, from /root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/custom_cuda_layers.h:16, from /root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu:1: /root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/gemm_test.h:6:10: fatal error: cuda_profiler_api.h: No such file or directory #include <cuda_profiler_api.h> ^~~~~~~~~ compilation terminated. ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build subprocess.run( File "/root/anaconda3/envs/bitten/lib/python3.8/subprocess.py", line 512, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "run_clm.py", line 478, in main() File "run_clm.py", line 441, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/transformers/trainer.py", line 1112, in train deepspeed_engine, optimizer, lr_scheduler = deepspeed_init( File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/transformers/deepspeed.py", line 355, in deepspeed_init model, optimizer, _, lr_scheduler = deepspeed.initialize( File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/init.py", line 125, in initialize engine = DeepSpeedEngine(args=args, File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 330, in init self._configure_optimizer(optimizer, model_parameters) File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1195, in _configure_optimizer basic_optimizer = self._configure_basic_optimizer(model_parameters) File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1266, in _configure_basic_optimizer optimizer = DeepSpeedCPUAdam(model_parameters, File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in init self.ds_opt_adam = CPUAdamBuilder().load() File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 460, in load return self.jit_load(verbose) File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 495, in jit_load op_module = load( File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1284, in load return _jit_compile( File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1508, in _jit_compile _write_ninja_file_and_build_library( File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1623, in _write_ninja_file_and_build_library _run_ninja_build( File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1916, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'cpu_adam' Exception ignored in: <function DeepSpeedCPUAdam.del at 0x7efc99bbd1f0> Traceback (most recent call last): File "/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/adam/cpu_adam.py", line 108, in del AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam' [2023-01-09 19:07:12,824] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 7728 [2023-01-09 19:07:12,826] [ERROR] [launch.py:324:sigkill_handler] AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'

Jan 09 '23 20:01 arain60gb

@arthur-morgan-712, you have a problem with your cuda environment:

/root/anaconda3/envs/bitten/lib/python3.8/site-packages/deepspeed/ops/csrc/includes/gemm_test.h:6:10:
 fatal error: cuda_profiler_api.h: No such file or directory
#include <cuda_profiler_api.h>

properly install the cuda environment, including all dev header files and do the run again and it'll work.

on ubuntu I usually recommend the nvidia cuda pre-packaged .deb files, but be careful the latest cuda is already in 12.x so you make sure you're installing the same major version as pytorch - most likely cuda-11.x - I think 11.8 is the latest in that line.

e.g. on my system the missing file is here:

/usr/local/cuda-11.8/targets/x86_64-linux/include/cuda_profiler_api.h

Jan 09 '23 22:01 stas00

In my case, this BUG is due to ninja compile error, you can change directory to ~/.cache/torch_extensions/cpu_adam, then run ninja -v to see the error details.

Apr 14 '23 10:04 Chiang97912

Closing this issue as the original issue was resolved. If anyone is having issues with this, please open a new issue and link this one and we would be happy to take a look.

Aug 18 '23 20:08 loadams

DeepSpeed DeepSpeed copied to clipboard

Error building extension 'cpu_adam'

DeepSpeed
DeepSpeed copied to clipboard