accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

Encountering raise ValueError("Integer parameters are unsupported") when using FSDP and load_in_8bit=True

Open markhng525 opened this issue 2 years ago • 3 comments

System Info

- `Accelerate` version: 0.18.0
- Platform: Linux-6.1.24-x86_64-with-glibc2.37
- Python version: 3.10.10
- Numpy version: 1.24.3
- PyTorch version (GPU?): 1.13.1+cu117 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: FSDP
        - mixed_precision: fp16
        - use_cpu: False
        - num_processes: 4
        - machine_rank: 0
        - num_machines: 4
        - main_process_ip: 0.0.0.0
        - main_process_port: 8080
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - fsdp_config: {'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch_policy': 'BACKWARD_PRE', 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 1, 'fsdp_state_dict_type': 'FULL_STATE_DICT', 'fsdp_transformer_layer_cls_to_wrap': 'T5Block'}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • [X] My own task or dataset (give details below)

Reproduction

When trying to launch my training job using the accelerate launcher using:

accelerate launch train.py

My python script is as follows:

from accelerate import Accelerator
from transformers import (
    T5ForConditionalGeneration,
)

MODEL_PATH = "google/flan-t5-small"

def train():
    model_name_or_path = MODEL_PATH
    model = T5ForConditionalGeneration.from_pretrained(
        model_name_or_path,
        device_map="auto",
        load_in_8bit=True,
    )
    accelerator = Accelerator()
    model = accelerator.prepare(model)


if __name__ == "__main__":
    train()

The full stacktrace is as follows:

Overriding torch_dtype=None with `torch_dtype=torch.float16` due to requirements of `bitsandbytes` to enable model loading in mixed int8. Either pass torch_dtype=torch.float16 or don't pass this argument at all to remove this warning.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
CUDA SETUP: CUDA runtime path found: /nix/store/0781hi5c3vb0v7h0s701adqgg4531qib-cuda-home/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Traceback (most recent call last):
  File "/home/markh/text-fine-tuning-experiments/./finetune/issue.py", line 21, in <module>
    train()
  File "/home/markh/text-fine-tuning-experiments/./finetune/issue.py", line 17, in train
    model = accelerator.prepare(model)
  File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1122, in prepare
    result = tuple(
  File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1123, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 977, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1227, in prepare_model
    model = FSDP(model, **kwargs)
  File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1036, in __init__
    self._auto_wrap(auto_wrap_kwargs, fsdp_kwargs)
  File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1291, in _auto_wrap
    _recursive_wrap(**auto_wrap_kwargs, **fsdp_kwargs)
  File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 403, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 403, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 403, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 421, in _recursive_wrap
    return _wrap(module, wrapper_cls, **kwargs), num_params
  File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py", line 350, in _wrap
    return wrapper_cls(module, **kwargs)
  File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1079, in __init__
    self._fsdp_wrapped_module = FlattenParamsWrapper(
  File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/flatten_params_wrapper.py", line 103, in __init__
    self._flat_param_handle = FlatParamHandle(params, module, device, config)
  File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/flat_param.py", line 270, in __init__
    self._init_flat_param(params, module)
  File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/flat_param.py", line 335, in _init_flat_param
    raise ValueError("Integer parameters are unsupported")
ValueError: Integer parameters are unsupported
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 100044) of binary: /home/markh/text-fine-tuning-experiments/.venv/bin/python
Traceback (most recent call last):
  File "/home/markh/text-fine-tuning-experiments/.devenv/state/venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 910, in launch_command
    multi_gpu_launcher(args)
  File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 603, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/markh/text-fine-tuning-experiments/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
./finetune/issue.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-05-12_00:10:08
  host      : markh-dev-server-gpu-1a.us-central1-a.c.ml-solutions-371721.inte
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 100044)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Expected behavior

accelerator.prepare should just create a wrapped model.

markhng525 avatar May 12 '23 00:05 markhng525

cc @younesbelkada

sgugger avatar May 16 '23 14:05 sgugger

hi @markhng525 This is not supported as pure int8 training is not supported - you may want to check how to train adapters on top of the model. I advise you to check the int8 training examples on peft library here: https://github.com/huggingface/peft/tree/main/examples/int8_training to check how to precisely do that Therefore you should first wrap your model into a PeftModel and call prepare afterwards. However, I am unsure if PeftModel + int8 is supported under FSDP - let us know how it goes

younesbelkada avatar May 16 '23 14:05 younesbelkada

Same error even after I update my script with get_peft_model

from accelerate import Accelerator
from transformers import (
    T5ForConditionalGeneration,
)

MODEL_PATH = "google/flan-t5-small"

def train():
    model_name_or_path = MODEL_PATH
    model = T5ForConditionalGeneration.from_pretrained(
        model_name_or_path,
        device_map="auto",
        load_in_8bit=True,
    )
    peft_config = LoraConfig(
        task_type=TaskType.SEQ_2_SEQ_LM,
        inference_mode=False,
        r=8,
        lora_alpha=32,
        lora_dropout=0.1,
    )
    model = get_peft_model(model, peft_config)
    accelerator = Accelerator()
    model = accelerator.prepare(model)


if __name__ == "__main__":
    train()

markhng525 avatar May 19 '23 22:05 markhng525

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jun 13 '23 15:06 github-actions[bot]