DeepSpeed
DeepSpeed copied to clipboard
[BUG] [ERROR] [autotuner.py:699:model_info_profile_run] The model is not runnable with DeepSpeed with error = (
auto.json:
{
"train_micro_batch_size_per_gpu": "auto",
"fp16": {
"enabled": true
},
"autotuning": {
"enabled": true,
"fast": false,
"overwrite": true
}
}
To Reproduce run:
deepspeed --autotuning run \
/workspaces/hf/script/run_classification.py \
--model_name_or_path ckip-joint/bloom-1b1-zh \
--do_train \
--do_eval \
--output_dir /workspaces/hf/bloom \
--train_file /workspaces/hf/data/train.csv \
--validation_file /workspaces/hf/data/test.csv \
--text_column_names sentence \
--label_column_name label \
--overwrite_output_dir \
--fp16 \
--torch_compile \
--deepspeed /workspaces/hf/cfg/auto.json
result:
[2023-12-02 13:51:46,927] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-02 13:51:47,930] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-12-02 13:51:47,930] [INFO] [autotuner.py:71:__init__] Created autotuning experiments directory: autotuning_exps
[2023-12-02 13:51:47,931] [INFO] [autotuner.py:84:__init__] Created autotuning results directory: autotuning_exps
[2023-12-02 13:51:47,931] [INFO] [autotuner.py:200:_get_resource_manager] active_resources = OrderedDict([('localhost', [0])])
[2023-12-02 13:51:47,931] [INFO] [runner.py:362:run_autotuning] [Start] Running autotuning
[2023-12-02 13:51:47,931] [INFO] [autotuner.py:669:model_info_profile_run] Starting model info profile run.
0%| | 0/1 [00:00<?, ?it/s][2023-12-02 13:51:47,933] [INFO] [scheduler.py:344:run_experiment] Scheduler wrote ds_config to autotuning_results/profile_model_info/ds_config.json, /workspaces/hf/autotuning_results/profile_model_info/ds_config.json
[2023-12-02 13:51:47,934] [INFO] [scheduler.py:351:run_experiment] Scheduler wrote exp to autotuning_results/profile_model_info/exp.json, /workspaces/hf/autotuning_results/profile_model_info/exp.json
[2023-12-02 13:51:47,934] [INFO] [scheduler.py:378:run_experiment] Launching exp_id = 0, exp_name = profile_model_info, with resource = localhost:0, and ds_config = /workspaces/hf/autotuning_results/profile_model_info/ds_config.json
localhost: ssh: connect to host localhost port 22: Cannot assign requested address
pdsh@dd68ccaa0e3d: localhost: ssh exited with exit code 255
[2023-12-02 13:52:03,391] [INFO] [scheduler.py:430:clean_up] Done cleaning up exp_id = 0 on the following workers: localhost
[2023-12-02 13:52:03,391] [INFO] [scheduler.py:393:run_experiment] Done running exp_id = 0, exp_name = profile_model_info, with resource = localhost:0
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:25<00:00, 25.01s/it]
[2023-12-02 13:52:12,946] [ERROR] [autotuner.py:699:model_info_profile_run] The model is not runnable with DeepSpeed with error = (
[2023-12-02 13:52:12,946] [INFO] [runner.py:367:run_autotuning] [End] Running autotuning
[2023-12-02 13:52:12,946] [INFO] [autotuner.py:1110:run_after_tuning] No optimal DeepSpeed configuration found by autotuning.
ds_report output
[2023-12-02 13:57:38,018] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 2.1.0+cu118
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.12.3, unknown, unknown
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.1, cuda 11.8
shared memory (/dev/shm) size .... 15.59 GB
System info (please complete the following information):
- docker image:
huggingface/transformers-pytorch-deepspeed-latest-gpu
, host: ubuntu 2204 - RTX 4060ti 16g
- Python 3.8.10
I got the same 【NO】with you. Have you figure it out? Thanks.
Me too, any solution here ?
Finally, I modified the Linux version and switched the mirror. And Deepspeed is not completely OK, but the program can still run.
Finally, I modified the Linux version and switched the mirror. And Deepspeed is not completely OK, but the program can still run.
Modified Linux version ... Sounds not easy to implement. Mayby I've to dive into the source code to fix this error