accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

Accelerate a non-HF model, like detectron2

Open cipri-tom opened this issue 1 year ago • 5 comments

System Info

- `Accelerate` version: 0.19.0
- Platform: Linux-5.10.147+-x86_64-with-glibc2.31
- Python version: 3.10.11
- Numpy version: 1.22.4
- PyTorch version (GPU?): 1.13.1+cu116 (True)
- System RAM: 12.68 GB
- GPU type: Tesla T4
- `Accelerate` default config:
	Not found

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • [X] My own task or dataset (give details below)

Reproduction

There is a model that embeds images and text, and I'd like to use it on 1 or 2 GPUs. Each GPU is smaller than the memory needed to run one inference, so I wanted to give Accelerate a try.

The model loading is not entirely controlled by me, it comes from the Detectron2 framework. There are many places where the framework calls .to(device), I wonder if that may be the source of the problem.

I am trying to run the model instantiation under with init_empty_weights():, but this is failing with Cannot copy out of meta tensor; no data!.

For reproduction, I have a colab link, with 3 added lines different from the official one (to change to accelerate): https://colab.research.google.com/drive/18UI5JTWlWkYCCKCWAlXeZqOXi46sPIFC?usp=sharing

The first run takes quite a few minutes (10+) due to install plus the download of the weights. Note that in Colab it may run fine, as their GPUs are slightly bigger (T4 has 16Gb, I have 2-4 x 11Gb).

Any pointers for how to run this?

Expected behavior

The init_empty_weights() should cache all model.to() calls so that it works when we don't have full control of the model initializaton.

I also expect to have maybe a bit more pointers for setting the weights of this model, when the torch.load() is not in our control. But this is extra, I think I can figure it out.

cipri-tom avatar May 12 '23 13:05 cipri-tom

OK, it turns out that the model.to(device) call was indeed the problem. I have removed that in the source framework, and the model was initialised with empty weights successfully.

Now, I am trying to compute a device_map:

device_map = infer_auto_device_map(model, max_memory={0: 5000, 1:5000})

This is failing with AttributeError: 'Parameter' object has no attribute 'named_children'.

image

doing a debugging (%debug cell below the errored one), the error comes from here: https://github.com/huggingface/accelerate/blob/ab379793d44be16d8fcac5c098a3ab9b6f5a7ec3/src/accelerate/utils/modeling.py#LL663C60-L663C60

The module is of type <class 'torch.nn.parameter.Parameter'>, which has no method named_children().

cipri-tom avatar May 12 '23 15:05 cipri-tom

There is also a bug here: https://github.com/huggingface/accelerate/blob/ab379793d44be16d8fcac5c098a3ab9b6f5a7ec3/src/accelerate/utils/modeling.py#LL767C100-L767C116

When the verbose is true, it can happen that the current device is disk, so the current_max_size is None, which generates an error:

image

cipri-tom avatar May 12 '23 15:05 cipri-tom

I can see where the two last issues stem from and fix them. For the first one, the best we can do is ignore all calls to to under the context manager, to make sure there is no error. I'm not sure if it could have other side effects however.

sgugger avatar May 16 '23 14:05 sgugger

The PR linked above should fix the two last issues if you want to give it a try.

sgugger avatar May 16 '23 18:05 sgugger

Hello,

Thank you for the fast action ! Indeed, the two issues are now fixed and I can correctly and verbosely compute a device map.

It outputs some negative numbers in the first steps if I pass some max_memory, not sure why, but this is not a real problem:

Screenshot 2023-06-07 at 16 36 33

I'm not sure about the side effects of ignoring .to() calls in the context manager. But they do seem counterproductive to accelerate's way of working, so accelerate should get control of it

cipri-tom avatar Jun 07 '23 14:06 cipri-tom

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jul 01 '23 15:07 github-actions[bot]