accelerate could we use load_checkpoint_and_dispatch in a deepspeed framework?

could we use load_checkpoint_and_dispatch in a deepspeed framework?

Open henryxiao1997 opened this issue 2 years ago • 1 comments

System Info

- `Accelerate` version: 0.19.0
- Platform: Linux-5.10.173-154.642.amzn2.x86_64
- Python version: 3.9.16
- Numpy version: 1.24.3
- PyTorch version: 2.0.0+cu117

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

I initialize with deepspeed (8 GPUs) and use load_checkpoint_and_dispatch to load the model which is stored in several shard files, it calls load_checkpoint_and_dispatch 8 times, which makes it fails to load. How could I do to make it load only once for 8 GPUs? Thanks!

The code looks like this:

world_size = torch.cuda.device_count() local_rank = int(os.getenv('LOCAL_RANK', '0'))

deepspeed.init_distributed("nccl") rank = dist.get_rank()

tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, padding_side="left") tokenizer.pad_token = tokenizer.eos_token

config = AutoConfig.from_pretrained(args.model_name_or_path)

with init_empty_weights(): model = AutoModelForCausalLM.from_config(config) model.tie_weights()

model = load_checkpoint_and_dispatch(model, checkpoint=args.model_name_or_path, device_map='balanced_low_0')

ds_model_engine = deepspeed.init_inference( model, mp_size=world_size, dtype=torch.float16, replace_with_kernel_inject=False, injection_policy={BloomBlock: ("self_attention.dense", "mlp.dense_4h_to_h")}, )

......

Expected behavior

load only once for 8 GPUs

Jun 05 '23 16:06 henryxiao1997

load_checkpoint_and_dispatch is intended for naive model parallelism and not compatible with DeepSpeed.

Jun 05 '23 16:06 sgugger

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jul 06 '23 15:07 github-actions[bot]

accelerate accelerate copied to clipboard

could we use load_checkpoint_and_dispatch in a deepspeed framework?

System Info

Information

Tasks

Reproduction

Expected behavior

accelerate
accelerate copied to clipboard