transformers
transformers copied to clipboard
transformers trainer llama Trying to resize storage that is not resizable
System Info
transformers ==4.28.0.dev0 pytorch==1.13.1
Who can help?
--
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
--
Expected behavior
I found that this is a bug with AutoModelForCausalLM, because the code that uses this module is unable to load the checkpoint.
Hi @lw3259111, thanks for raising this issue.
So that we can best try and help, could you provide some more information about how to reproduce this error. Specifically the following:
-
The running environment and important dependency versions. This can be found running
transformers-cli env
in your terminal -
A minimal code snippet to reproduce the error. If, for anonymity, it's not possible to share a checkpoint name, it's OK to do something like the example below. This so we know how e.g. the
Trainer
class is being called and the possible code path triggering this issue.
from transformers import AutoModelForCausalLM
checkpoint = "checkpoint-name" # Dummy name
model = AutoModelForCausalLM.from_pretrained(checkpoint)
@amyeroberts ` Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
-
transformers
version: 4.28.0.dev0 - Platform: Linux-4.15.0-208-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Huggingface_hub version: 0.13.2
- Safetensors version: 0.3.0
- PyTorch version (GPU?): 2.0.0+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
`
model_name = "checkpoints-1200"
The error :
@lw3259111, great, thanks for the additional details. For the checkpoint that's being loaded, which model architecture does it map to i.e. which XxxForCausalLM
model?
@amyeroberts
I want to load LlamaForCausalLM
model
and the same error has beed found in follow link
`
https://github.com/tatsu-lab/stanford_alpaca/issues/61#issuecomment-1504117715
https://github.com/lm-sys/FastChat/issues/351 `
@lw3259111 Thanks for the additional information. I'm able to load some checkpoints with both of the following:
model = LlamaForCausalLM.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)
without this error occurring. So the issue is likely relating to the specific weights being loaded, model configuration or something else in the environment.
A few questions, comments and suggestions:
- Looking at the screenshots shared, in the first one in this comment, I can see there is an error being triggered relating to
git-lfs
not being installed in the environment. Could you try installing or reinstallinggit lfs
? It's worthwhile making sure this work, but I doubt this is the issue. - In the linked issues, the version of transformers in your env is different from in this issue. I'm assuming a typo, but can you confirm the version. Note: the transformers library needs to be install from source to use the Llama model.
- When
model = AutoModelForCausalLM.from_pretrained(checkpoint, low_cpu_mem_usage=True, **kwargs)
is called, could you share what the kwargs are? - Following this issue, is the model being loaded one which was saved out after using the
Trainer
class?
@amyeroberts Thank you for your reply. I will reply to your questions one by one
-
git-lfs has been installed in my compute
-
my transformers version is 4.28.0.dev0, https://github.com/tatsu-lab/stanford_alpaca/issues/61#issuecomment-1504459664. I made a mistake in writing the corresponding Transformers version of this link, and I have made the necessary modifications
-
kwargs
are{'torch_dtype': torch.float16, 'device_map': 'auto', 'max_memory': {0: '13GiB', 1: '13GiB'}}
-
yes,The checkpoint-1200 was saved out after using the Trainer class
https://github.com/lm-sys/FastChat/issues/351#issuecomment-1519060027
This is related to https://github.com/lm-sys/FastChat/issues/256#issue-1658116931
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.