DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] `max_in_cpu` seems to be ignored?

Open andre-bauer opened this issue 1 year ago • 5 comments

Describe the bug I evaluate OPT-66B with Zero3 and set offloading to nvme which works fine, but I also increased max_in_cpu to 100G

printed as

DeepSpeedZeroOffloadParamConfig(device='nvme', nvme_path=PosixPath('/tmp'), buffer_count=5, buffer_size=100000000, max_in_cpu=100000000000, pin_memory=True)

In float16 I would expect that up to 200GB of host memory are used but I get a usage of ~16GB of my available 300GB. While setting in this case to "cpu" instead of "nvme" I get the same behavior for models that exceed 300GB like bloom-176B Am I missing something? How can you use nvme and cpu properly.

ds_config

{
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "nvme",
            "nvme_path": "/tmp",
            "pin_memory": true,
            "buffer_count": 5,
            "buffer_size": 1e8,
            "max_in_cpu": 1e11
        }
    },
    "load_from_fp32_weights":false,
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false,
    "fp16": {
        "enabled": true
    }
}

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [YES] ...... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
 [WARNING]  On Ampere and higher architectures please use CUDA 11+
spatial_inference ...... [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  On Ampere and higher architectures please use CUDA 11+
transformer_inference .. [NO] ....... [NO]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/pyenv-root/versions/3.9.16/lib/python3.9/site-packages/torch']
torch version .................... 2.0.1+cu118
deepspeed install path ........... ['/opt/pyenv-root/versions/3.9.16/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.10.1, unknown, unknown
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 10.1
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8
shared memory (/dev/shm) size .... 50.00 GB

andre-bauer avatar Aug 25 '23 15:08 andre-bauer