transformers icon indicating copy to clipboard operation
transformers copied to clipboard

transformers trainer llama Trying to resize storage that is not resizable

Open lw3259111 opened this issue 1 year ago • 7 comments

System Info

transformers ==4.28.0.dev0 pytorch==1.13.1

Who can help?

--

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

--

Expected behavior

image I found that this is a bug with AutoModelForCausalLM, because the code that uses this module is unable to load the checkpoint.

lw3259111 avatar Apr 11 '23 11:04 lw3259111

Hi @lw3259111, thanks for raising this issue.

So that we can best try and help, could you provide some more information about how to reproduce this error. Specifically the following:

  • The running environment and important dependency versions. This can be found running transformers-cli env in your terminal

  • A minimal code snippet to reproduce the error. If, for anonymity, it's not possible to share a checkpoint name, it's OK to do something like the example below. This so we know how e.g. the Trainer class is being called and the possible code path triggering this issue.

from transformers import AutoModelForCausalLM

checkpoint = "checkpoint-name" # Dummy name 
model = AutoModelForCausalLM.from_pretrained(checkpoint)

amyeroberts avatar Apr 11 '23 11:04 amyeroberts

@amyeroberts ` Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

  • transformers version: 4.28.0.dev0
  • Platform: Linux-4.15.0-208-generic-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • Huggingface_hub version: 0.13.2
  • Safetensors version: 0.3.0
  • PyTorch version (GPU?): 2.0.0+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?: `

model_name = "checkpoints-1200" The error : image image image

lw3259111 avatar Apr 11 '23 13:04 lw3259111

@lw3259111, great, thanks for the additional details. For the checkpoint that's being loaded, which model architecture does it map to i.e. which XxxForCausalLM model?

amyeroberts avatar Apr 11 '23 16:04 amyeroberts

@amyeroberts I want to load LlamaForCausalLM model and the same error has beed found in follow link ` https://github.com/tatsu-lab/stanford_alpaca/issues/61#issuecomment-1504117715

https://github.com/lm-sys/FastChat/issues/351 `

lw3259111 avatar Apr 12 '23 01:04 lw3259111

@lw3259111 Thanks for the additional information. I'm able to load some checkpoints with both of the following:

model = LlamaForCausalLM.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)

without this error occurring. So the issue is likely relating to the specific weights being loaded, model configuration or something else in the environment.

A few questions, comments and suggestions:

  • Looking at the screenshots shared, in the first one in this comment, I can see there is an error being triggered relating to git-lfs not being installed in the environment. Could you try installing or reinstalling git lfs? It's worthwhile making sure this work, but I doubt this is the issue.
  • In the linked issues, the version of transformers in your env is different from in this issue. I'm assuming a typo, but can you confirm the version. Note: the transformers library needs to be install from source to use the Llama model.
  • When model = AutoModelForCausalLM.from_pretrained(checkpoint, low_cpu_mem_usage=True, **kwargs) is called, could you share what the kwargs are?
  • Following this issue, is the model being loaded one which was saved out after using the Trainer class?

amyeroberts avatar Apr 12 '23 14:04 amyeroberts

@amyeroberts Thank you for your reply. I will reply to your questions one by one

  • git-lfs has been installed in my compute image

  • my transformers version is 4.28.0.dev0, https://github.com/tatsu-lab/stanford_alpaca/issues/61#issuecomment-1504459664. I made a mistake in writing the corresponding Transformers version of this link, and I have made the necessary modifications

  • kwargs are {'torch_dtype': torch.float16, 'device_map': 'auto', 'max_memory': {0: '13GiB', 1: '13GiB'}}

  • yes,The checkpoint-1200 was saved out after using the Trainer class

lw3259111 avatar Apr 12 '23 16:04 lw3259111

https://github.com/lm-sys/FastChat/issues/351#issuecomment-1519060027

This is related to https://github.com/lm-sys/FastChat/issues/256#issue-1658116931

sahalshajim avatar Apr 23 '23 12:04 sahalshajim

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar May 17 '23 15:05 github-actions[bot]