DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

DeepSpeed initialization with GNN-like model

Open buttercutter opened this issue 2 years ago • 20 comments

My code is quite similar to some GNN structure : NN_output = graph.forward(NN_input, types="f")

So, outputs = model_engine(inputs) seems does not really fit in my case ? args also does not follow such code styling.

Any idea ?

buttercutter avatar Jun 18 '22 19:06 buttercutter

I did some coding modifications, however I could not initialize deepspeed properly.

/home/phung/PycharmProjects/venv/py39/bin/python /home/phung/PycharmProjects/beginner_tutorial/gdas.py
Files already downloaded and verified
Files already downloaded and verified
[2022-07-13 17:00:25,770] [INFO] [logging.py:69:log_dist] [Rank -1] DeepSpeed info: version=0.6.5, git-hash=unknown, git-branch=unknown
[2022-07-13 17:00:25,782] [INFO] [distributed.py:36:init_distributed] Not using the DeepSpeed or torch.distributed launchers, attempting to detect MPI environment...
[2022-07-13 17:00:27,782] [INFO] [distributed.py:85:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=archlinux, master_port=29500
[2022-07-13 17:00:27,782] [INFO] [distributed.py:48:init_distributed] Initializing torch distributed with backend: nccl
Traceback (most recent call last):
  File "/home/phung/PycharmProjects/beginner_tutorial/gdas.py", line 936, in <module>
    model_engine_, optimizer, trainloader, __ = deepspeed.initialize(args=args_, model=graph_,
  File "/home/phung/PycharmProjects/venv/py39/lib/python3.9/site-packages/deepspeed/__init__.py", line 120, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/phung/PycharmProjects/venv/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 238, in __init__
    self._do_args_sanity_check(args)
  File "/home/phung/PycharmProjects/venv/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 900, in _do_args_sanity_check
    assert (
AssertionError: DeepSpeed requires --deepspeed_config to specify configuration file

Process finished with exit code 1

image

buttercutter avatar Jul 13 '22 09:07 buttercutter

@buttercutter, you are missing a deepspeed config file on the command passed by --deepspeed_config. image

Alternatively, you can pass a dict as config_params to deepspeed.initialize()

tjruwase avatar Jul 13 '22 10:07 tjruwase

Do you have a recommended deepspeed configuration file ?

Note: The deepspeed configuration for training transformer-like network structure might be different from that for GNN-like network structure.

buttercutter avatar Jul 13 '22 11:07 buttercutter

If I use the above configuration file from HuggingFace, I have the following error:

model_engine_, optimizer, trainloader, __ = deepspeed.initialize(args=args_, model=graph_, model_parameters=parameters, training_data=trainset, config_params='./ds_config.json')

/home/phung/PycharmProjects/venv/py39/bin/python /home/phung/PycharmProjects/beginner_tutorial/gdas.py
Files already downloaded and verified
Files already downloaded and verified
[2022-07-13 19:10:10,635] [INFO] [logging.py:69:log_dist] [Rank -1] DeepSpeed info: version=0.6.5, git-hash=unknown, git-branch=unknown
[2022-07-13 19:10:10,648] [INFO] [distributed.py:36:init_distributed] Not using the DeepSpeed or torch.distributed launchers, attempting to detect MPI environment...
[2022-07-13 19:10:12,517] [INFO] [distributed.py:85:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=archlinux, master_port=29500
[2022-07-13 19:10:12,517] [INFO] [distributed.py:48:init_distributed] Initializing torch distributed with backend: nccl
Traceback (most recent call last):
  File "/home/phung/PycharmProjects/beginner_tutorial/gdas.py", line 936, in <module>
    model_engine_, optimizer, trainloader, __ = deepspeed.initialize(args=args_, model=graph_,
  File "/home/phung/PycharmProjects/venv/py39/lib/python3.9/site-packages/deepspeed/__init__.py", line 120, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/phung/PycharmProjects/venv/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 239, in __init__
    self._configure_with_arguments(args, mpu)
  File "/home/phung/PycharmProjects/venv/py39/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 872, in _configure_with_arguments
    self._config = DeepSpeedConfig(self.config, mpu)
  File "/home/phung/PycharmProjects/venv/py39/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 874, in __init__
    self._initialize_params(self._param_dict)
  File "/home/phung/PycharmProjects/venv/py39/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 903, in _initialize_params
    assert not (self.fp16_enabled and self.bfloat16_enabled), 'bfloat16 and fp16 modes cannot be simultaneously enabled'
AssertionError: bfloat16 and fp16 modes cannot be simultaneously enabled

Process finished with exit code 1

Besides, the IDE software also complains on the following two issues.

Cannot find reference 'parse_args' in 'parser.pyi' at line 917

Expected type 'Optional[Module]', got 'filter[Parameter]' instead at line 939

buttercutter avatar Jul 13 '22 11:07 buttercutter

DeepSpeed configuration is meant to be network-agnostic, so in reality that configuration file would work except for auto fields which are defined for the HF frontend. The configuration file is used to enable/disable different features of the DeepSpeed framework, rather than to specify or control network properties. You can start with a minimal configuration file that defines just micro_batch_size, optimizer, and logging like below:

{
 "train_micro_batch_size_per_gpu": 1,
"steps_per_print": 1, 
 "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": <add your learning rate>
        }
    }
}

You can progressively add more configuration knobs as you get more familiar with DeepSpeed.

tjruwase avatar Jul 13 '22 11:07 tjruwase

I have the following runtime error on conflicting batch_size values ?

ValueError: Expected input batch_size (8) to match target batch_size (1).

Files already downloaded and verified
Files already downloaded and verified
[2022-07-13 13:15:18,174] [INFO] [logging.py:69:log_dist] [Rank -1] DeepSpeed info: version=0.6.5, git-hash=unknown, git-branch=unknown
[2022-07-13 13:15:18,188] [INFO] [distributed.py:37:init_distributed] Not using the DeepSpeed or torch.distributed launchers, attempting to detect MPI environment...
[2022-07-13 13:15:18,635] [INFO] [distributed.py:91:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=172.28.0.2, master_port=29500
[2022-07-13 13:15:18,635] [INFO] [distributed.py:49:init_distributed] Initializing torch distributed with backend: nccl
[2022-07-13 13:15:18,765] [INFO] [engine.py:279:__init__] DeepSpeed Flops Profiler Enabled: False
Installed CUDA version 11.1 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Using /root/.cache/torch_extensions/py37_cu113 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py37_cu113/fused_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py37_cu113/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/usr/local/lib/python3.7/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.7/dist-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib/python3.7/dist-packages/torch/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.7/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_75,code=compute_75 -std=c++14 -c /usr/local/lib/python3.7/dist-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o 
[2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/usr/local/lib/python3.7/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.7/dist-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib/python3.7/dist-packages/torch/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.7/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /usr/local/lib/python3.7/dist-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o 
[3/3] c++ fused_adam_frontend.o multi_tensor_adam.cuda.o -shared -L/usr/local/lib/python3.7/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o fused_adam.so
Loading extension module fused_adam...
Time to load fused_adam op: 31.784398078918457 seconds
[2022-07-13 13:15:51,799] [INFO] [engine.py:1102:_configure_optimizer] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2022-07-13 13:15:52,015] [INFO] [engine.py:1109:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam
[2022-07-13 13:15:52,015] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
[2022-07-13 13:15:52,016] [INFO] [engine.py:795:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2022-07-13 13:15:52,016] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2022-07-13 13:15:52,016] [INFO] [logging.py:69:log_dist] [Rank 0] step=0, skipped=0, lr=[0.05], mom=[(0.9, 0.999)]
[2022-07-13 13:15:52,020] [INFO] [config.py:1059:print] DeepSpeedEngine configuration:
[2022-07-13 13:15:52,021] [INFO] [config.py:1063:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2022-07-13 13:15:52,021] [INFO] [config.py:1063:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2022-07-13 13:15:52,021] [INFO] [config.py:1063:print]   amp_enabled .................. False
[2022-07-13 13:15:52,021] [INFO] [config.py:1063:print]   amp_params ................... False
[2022-07-13 13:15:52,021] [INFO] [config.py:1063:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": null, 
    "exps_dir": null, 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2022-07-13 13:15:52,021] [INFO] [config.py:1063:print]   bfloat16_enabled ............. False
[2022-07-13 13:15:52,021] [INFO] [config.py:1063:print]   checkpoint_tag_validation_enabled  True
[2022-07-13 13:15:52,021] [INFO] [config.py:1063:print]   checkpoint_tag_validation_fail  False
[2022-07-13 13:15:52,021] [INFO] [config.py:1063:print]   communication_data_type ...... None
[2022-07-13 13:15:52,021] [INFO] [config.py:1063:print]   curriculum_enabled ........... False
[2022-07-13 13:15:52,021] [INFO] [config.py:1063:print]   curriculum_params ............ False
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   dataloader_drop_last ......... False
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   disable_allgather ............ False
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   dump_state ................... False
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   dynamic_loss_scale_args ...... None
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   eigenvalue_enabled ........... False
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   eigenvalue_gas_boundary_resolution  1
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   eigenvalue_layer_num ......... 0
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   eigenvalue_max_iter .......... 100
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   eigenvalue_stability ......... 1e-06
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   eigenvalue_tol ............... 0.01
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   eigenvalue_verbose ........... False
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   elasticity_enabled ........... False
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   flops_profiler_config ........ {
    "enabled": false, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   fp16_enabled ................. False
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   fp16_master_weights_and_gradients  False
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   fp16_mixed_quantize .......... False
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   global_rank .................. 0
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   gradient_accumulation_steps .. 1
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   gradient_clipping ............ 0.0
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   gradient_predivide_factor .... 1.0
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   initial_dynamic_scale ........ 4294967296
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   loss_scale ................... 0
[2022-07-13 13:15:52,022] [INFO] [config.py:1063:print]   memory_breakdown ............. False
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   optimizer_legacy_fusion ...... False
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   optimizer_name ............... adamw
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   optimizer_params ............. {'lr': 0.05}
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   pld_enabled .................. False
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   pld_params ................... False
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   prescale_gradients ........... False
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   quantize_change_rate ......... 0.001
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   quantize_groups .............. 1
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   quantize_offset .............. 1000
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   quantize_period .............. 1000
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   quantize_rounding ............ 0
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   quantize_start_bits .......... 16
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   quantize_target_bits ......... 8
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   quantize_training_enabled .... False
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   quantize_type ................ 0
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   quantize_verbose ............. False
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   scheduler_name ............... None
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   scheduler_params ............. None
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   sparse_attention ............. None
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   sparse_gradients_enabled ..... False
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   steps_per_print .............. 1
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   tensorboard_enabled .......... False
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   tensorboard_job_name ......... DeepSpeedJobName
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   tensorboard_output_path ...... 
[2022-07-13 13:15:52,023] [INFO] [config.py:1063:print]   train_batch_size ............. 1
[2022-07-13 13:15:52,024] [INFO] [config.py:1063:print]   train_micro_batch_size_per_gpu  1
[2022-07-13 13:15:52,024] [INFO] [config.py:1063:print]   use_quantizer_kernel ......... False
[2022-07-13 13:15:52,024] [INFO] [config.py:1063:print]   wall_clock_breakdown ......... False
[2022-07-13 13:15:52,024] [INFO] [config.py:1063:print]   world_size ................... 1
[2022-07-13 13:15:52,024] [INFO] [config.py:1063:print]   zero_allow_untested_optimizer  False
[2022-07-13 13:15:52,024] [INFO] [config.py:1063:print]   zero_config .................. {
    "stage": 0, 
    "contiguous_gradients": true, 
    "reduce_scatter": true, 
    "reduce_bucket_size": 5.000000e+08, 
    "allgather_partitions": true, 
    "allgather_bucket_size": 5.000000e+08, 
    "overlap_comm": false, 
    "load_from_fp32_weights": true, 
    "elastic_checkpoint": false, 
    "offload_param": null, 
    "offload_optimizer": null, 
    "sub_group_size": 1.000000e+09, 
    "prefetch_bucket_size": 5.000000e+07, 
    "param_persistence_threshold": 1.000000e+05, 
    "max_live_parameters": 1.000000e+09, 
    "max_reuse_distance": 1.000000e+09, 
    "gather_16bit_weights_on_model_save": false, 
    "ignore_unused_parameters": true, 
    "round_robin_gradients": false, 
    "legacy_stage1": false
}
[2022-07-13 13:15:52,024] [INFO] [config.py:1063:print]   zero_enabled ................. False
[2022-07-13 13:15:52,024] [INFO] [config.py:1063:print]   zero_optimization_stage ...... 0
[2022-07-13 13:15:52,024] [INFO] [config.py:1071:print]   json = {
    "train_micro_batch_size_per_gpu": 1, 
    "steps_per_print": 1, 
    "optimizer": {
        "type": "AdamW", 
        "params": {
            "lr": 0.05
        }
    }
}
Using /root/.cache/torch_extensions/py37_cu113 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py37_cu113/utils...
Emitting ninja build file /root/.cache/torch_extensions/py37_cu113/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -isystem /usr/local/lib/python3.7/dist-packages/torch/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.7/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.7/dist-packages/torch/include/THC -isystem /usr/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /usr/local/lib/python3.7/dist-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o 
[2/2] c++ flatten_unflatten.o -shared -L/usr/local/lib/python3.7/dist-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
Loading extension module utils...
Time to load utils op: 16.06555199623108 seconds
run_num =  0
Traceback (most recent call last):
  File "gdas.py", line 947, in <module>
    ltrain = train_NN(graph=graph_, model_engine=model_engine_, forward_pass_only=0)
  File "gdas.py", line 690, in train_NN
    Ltrain = criterion(NN_output, NN_train_labels)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/loss.py", line 1166, in forward
    label_smoothing=self.label_smoothing)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 3014, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
ValueError: Expected input batch_size (8) to match target batch_size (1).
[85b173f58da1:00656] *** Process received signal ***
[85b173f58da1:00656] Signal: Segmentation fault (11)
[85b173f58da1:00656] Signal code: Address not mapped (1)
[85b173f58da1:00656] Failing at address: 0x7f751665320d
[85b173f58da1:00656] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7f75192fd980]
[85b173f58da1:00656] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7f7518f3c775]
[85b173f58da1:00656] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7f75197a7e44]
[85b173f58da1:00656] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7f7518f3d605]
[85b173f58da1:00656] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7f75197a5cb3]
[85b173f58da1:00656] *** End of error message ***

buttercutter avatar Jul 13 '22 13:07 buttercutter

Set "train_micro_batch_size_per_gpu" to 8 in the configuration file.

tjruwase avatar Jul 13 '22 14:07 tjruwase

May I ask if retain_graph=True is fully supported now ?

buttercutter avatar Jul 13 '22 17:07 buttercutter

It should be, but please report any issues.

tjruwase avatar Jul 13 '22 18:07 tjruwase

model_engine.backward(Ltrain, retain_graph=True) gave the following error ?

Traceback (most recent call last):
  File "gdas.py", line 947, in <module>
    ltrain = train_NN(graph=graph_, model_engine=model_engine_, forward_pass_only=0)
  File "gdas.py", line 700, in train_NN
    model_engine.backward(Ltrain, retain_graph=True)
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
TypeError: backward() got an unexpected keyword argument 'retain_graph'

buttercutter avatar Jul 14 '22 00:07 buttercutter

@tjruwase May I know why retain_graph still does not work for me ?

buttercutter avatar Jul 16 '22 03:07 buttercutter

Sorry, it appears #1149 was never merged. Unfortunately, it has a conflict with master. Can you please try picking that up?

tjruwase avatar Jul 16 '22 21:07 tjruwase

@buttercutter, #1149 is now merged. Please try master.

tjruwase avatar Jul 30 '22 17:07 tjruwase

@tjruwase

Why Expected type 'Module | None', got 'filter[Parameter]' instead error for model_parameter ?

image

buttercutter avatar Aug 27 '22 01:08 buttercutter

This is a type error. Please see doc for deepspeed.initialize().

tjruwase avatar Aug 28 '22 02:08 tjruwase

The same code works perfectly fine within google colab GPU cloud environment.

So, I guess this above type error is due to local installation issue.

However, deepspeed still give RuntimeError: CUDA out of memory. Could you advise what could have gone wrong ?

image

buttercutter avatar Aug 28 '22 14:08 buttercutter

The same code works perfectly fine within google colab GPU cloud environment.

So, I guess this above type error is due to local installation issue.

This is quite strange. It would be good to figure out what is different about the local and colab installations. Do you mind printing out the types of every parameter passed to deepspeed.initialize()?

tjruwase avatar Aug 29 '22 11:08 tjruwase

Exception: Installed CUDA version 11.7 does not match the version torch was compiled with 10.2, unable to compile cuda/cpp extensions without a matching cuda version.

Local installation seems to have failed with some CUDA and torch version incompatibility.

The following is the output for online google colab GPU cloud environment.

print("type(args) = ", type(args_))
print("type(graph_) = ", type(graph_))
print("type(parameters) = ", type(parameters))
print("type(trainset) = ", type(trainset))

type(args) =  <class 'argparse.Namespace'>
type(graph_) =  <class '__main__.Graph'>
type(parameters) =  <class 'filter'>
type(trainset) =  <class 'torchvision.datasets.cifar.CIFAR10'>

buttercutter avatar Sep 03 '22 02:09 buttercutter

@tjruwase I see no issue with the initialization coding at least within the working online google colab GPU cloud environment.

Shall I open up a different github issue since this is an entirely different problem ?

buttercutter avatar Sep 05 '22 15:09 buttercutter

@buttercutter, yes, please open a new issue. Thanks!

tjruwase avatar Sep 05 '22 22:09 tjruwase