ImportError: cannot import name 'shard_checkpoint' from 'transformers.modeling_utils' (transformers 4.49.0)
Please check that this issue hasn't been reported before.
- [x] I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
When merging a DPO QLoRA model, I encountered the following error:
ImportError: cannot import name 'shard_checkpoint' from 'transformers.modeling_utils'
This issue does not occur with transformers 4.46.3, but it happens with 4.49.0.
Environment
- Python: 3.11.12
- transformers: 4.49.0
- axolotl: 0.7.1
Steps to Reproduce
- Install transformers 4.49.0
- Run the following command:
from transformers.modeling_utils import shard_checkpoint - See the ImportError
Expected behavior
The function shard_checkpoint should be accessible as it was in transformers 4.46.3.
Additional context
Has shard_checkpoint been deprecated or moved in transformers 4.49.0? If so, what is the recommended alternative?
Current behaviour
ImportError: cannot import name 'shard_checkpoint' from 'transformers.modeling_utils'
Steps to reproduce
- dpo qlora train
- python3 -m axolotl.cli.merge_lora .../...yaml --lora_model_dir:"/../../../"
Config yaml
Possible solution
No response
Which Operating Systems are you using?
- [x] Linux
- [ ] macOS
- [ ] Windows
Python Version
3.11
axolotl branch-commit
main
Acknowledgements
- [x] My issue title is concise, descriptive, and in title casing.
- [x] I have searched the existing issues to make sure this bug has not been reported yet.
- [x] I am using the latest version of axolotl.
- [x] I have provided enough information for the maintainers to reproduce and diagnose the issue.
@shing100 , hey, could you provide us the stack trace for this? I don't see any explicit calls to that function on our end.
They had a deprecation warning for a while as well https://github.com/huggingface/transformers/blob/c2820c94916e34baf4486accae74760972183a2f/src/transformers/modeling_utils.py#L400-L403
I used git pull to get it and then installed it. I will delete it again and then install it.
After updating the repository, multi-GPU learning is using too much memory. It takes 343GB of memory to train an 8B model. This is very wrong. Previously, training a 32B model with H100*16 was possible (Multi-GPU environment(2 nodes) + deepspeed zero3 + liger-kernal), but now it is not possible.
Looking at the stack trace, are you doing AWQ training?
After updating the repository, multi-GPU learning is using too much memory. It takes 343GB of memory to train an 8B model. This is very wrong.
Do you perhaps have the logs between which to which commit you were on? Or are able to narrow down to potential offending commits? That's too much memory usage for 8B.
The first picture above is an error when using the merge method after dpo learning using qlora.
When SFT training the 7.8b model with 2 nodes (H100*8), we use a total of 454.08 GiB.
- Liger-kernal + deepspeed zero3
- micro batch size 1
- sequence_len 8192
Learning the 32b model with the same setting results in oom.
Hey, let's try to tackle the original issue first.
Does the shard_checkpoint issue still exist on merge? Maybe run a training for max_steps: 10 or with save_steps: 5 to then get it to merge?
Regarding the second issue, do you have a timeframe from whence you upgraded your codebase from so it's possible to track down this problem?
The first issue has been resolved since the reinstallation, thank you.
After using the v0.5.0 version, I proceeded with the update this time, so tracking is difficult..
The first issue has been resolved since the reinstallation, thank you.
Thanks for clarifying.
After using the v0.5.0 version, I proceeded with the update this time, so tracking is difficult..
That'll be quite troubling to track as there has been a lot of changes since hmm
Then I'll solve it first by rolling back and using it.
Yep, will recommend keeping the old transformers / peft version if those features work for you!
https://github.com/huggingface/trl/issues/2864
What base model are you using? MPT? Are you using a pre-quantized base model?
Can you share a full YAML configuration? I've been able to SFT a 70B on 2x8xH100 using Zero-3 + liger as well with minimal problems at a sequence length of 16k and batch size or 4.
I'm learning llamafy models. The same is issue with Qwen2.5.
I'm using the settings below and I'm using zero3.json as the --deepspeed option.
Please let me know if there is anything wrong with the setting I am using
base_model: Qwen/Qwen2.5-32B-Instruct
plugins:
- axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
liger_fused_linear_cross_entropy: true
strict: false
chat_template: tokenizer_default
datasets:
- path: CarrotAI/Korean-Common
type: chat_template
field_messages: conversations
message_property_mappings:
role: from
content: value
#dataset_exact_deduplication: true
default_system_message: "You are a helpful assistant. Please give a long and detailed answer."
dataset_prepared_path: last_run_prepared
val_set_size: 0.05
output_dir: /output/sft
# 16384, 8192
sequence_len: 8192
sample_packing: false
pad_to_sequence_len: true
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 2
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 7e-6
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
xformers_attention:
flash_attention: true
warmup_steps: 100
evals_per_epoch: 0
eval_table_size:
eval_batch_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
cache_dir: ~/cache
environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_config_file: /data/axolotl/deepspeed_configs/zero3.json
deepspeed_hostfile: /data/axolotl/hosts/hostfile
deepspeed_multinode_launcher: pdsh
zero3_init_flag: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_process_ip: [main_ip]
main_process_port: [main_port]
main_training_function: main
num_machines: 2
num_processes: 16
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Would my config file be weird? Could you share the config file that works normally?
Hey @shing100 , have you tried giving another run from git latest a try again to see if the problem is solved? (Make sure to install the dependencies at the time)
It works fine if you run it with yes 0.4.1 version. Or if you run it with cpu_offload in the current version, But it takes a very long time.
@shing100 , do you mean 0.8.1? What do you mean by it goes back?
yes 0.8.1 version
@shing100 , sorry, could you remind us, what is the current issue now? It sounds like the original issue is fixed. Do you need to make a separate issue to keep things organized?