axolotl ImportError: cannot import name 'shard_checkpoint' from 'transformers.modeling

Please check that this issue hasn't been reported before.

[x] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

When merging a DPO QLoRA model, I encountered the following error:

ImportError: cannot import name 'shard_checkpoint' from 'transformers.modeling_utils'

This issue does not occur with transformers 4.46.3, but it happens with 4.49.0.

Environment

Python: 3.11.12
transformers: 4.49.0
axolotl: 0.7.1

Steps to Reproduce

Install transformers 4.49.0

Run the following command:

from transformers.modeling_utils import shard_checkpoint

See the ImportError

Expected behavior
The function shard_checkpoint should be accessible as it was in transformers 4.46.3.

Additional context
Has shard_checkpoint been deprecated or moved in transformers 4.49.0? If so, what is the recommended alternative?

Current behaviour

ImportError: cannot import name 'shard_checkpoint' from 'transformers.modeling_utils'

Steps to reproduce

dpo qlora train
python3 -m axolotl.cli.merge_lora .../...yaml --lora_model_dir:"/../../../"

Config yaml

Possible solution

No response

Which Operating Systems are you using?

[x] Linux
[ ] macOS
[ ] Windows

Python Version

3.11

axolotl branch-commit

main

Acknowledgements

[x] My issue title is concise, descriptive, and in title casing.
[x] I have searched the existing issues to make sure this bug has not been reported yet.
[x] I am using the latest version of axolotl.
[x] I have provided enough information for the maintainers to reproduce and diagnose the issue.

Mar 06 '25 10:03 shing100

@shing100 , hey, could you provide us the stack trace for this? I don't see any explicit calls to that function on our end.

Mar 07 '25 11:03 NanoCode012

They had a deprecation warning for a while as well https://github.com/huggingface/transformers/blob/c2820c94916e34baf4486accae74760972183a2f/src/transformers/modeling_utils.py#L400-L403

Mar 07 '25 11:03 NanoCode012

I used git pull to get it and then installed it. I will delete it again and then install it.

Mar 09 '25 02:03 shing100

After updating the repository, multi-GPU learning is using too much memory. It takes 343GB of memory to train an 8B model. This is very wrong. Previously, training a 32B model with H100*16 was possible (Multi-GPU environment(2 nodes) + deepspeed zero3 + liger-kernal), but now it is not possible.

Mar 10 '25 04:03 shing100

Looking at the stack trace, are you doing AWQ training?

After updating the repository, multi-GPU learning is using too much memory. It takes 343GB of memory to train an 8B model. This is very wrong.

Do you perhaps have the logs between which to which commit you were on? Or are able to narrow down to potential offending commits? That's too much memory usage for 8B.

Mar 10 '25 06:03 NanoCode012

The first picture above is an error when using the merge method after dpo learning using qlora.

When SFT training the 7.8b model with 2 nodes (H100*8), we use a total of 454.08 GiB.

Liger-kernal + deepspeed zero3
micro batch size 1
sequence_len 8192

Learning the 32b model with the same setting results in oom.

Mar 10 '25 06:03 shing100

Hey, let's try to tackle the original issue first.

Does the shard_checkpoint issue still exist on merge? Maybe run a training for max_steps: 10 or with save_steps: 5 to then get it to merge?

Regarding the second issue, do you have a timeframe from whence you upgraded your codebase from so it's possible to track down this problem?

Mar 10 '25 07:03 NanoCode012

The first issue has been resolved since the reinstallation, thank you.

After using the v0.5.0 version, I proceeded with the update this time, so tracking is difficult..

Mar 10 '25 09:03 shing100

The first issue has been resolved since the reinstallation, thank you.

Thanks for clarifying.

After using the v0.5.0 version, I proceeded with the update this time, so tracking is difficult..

That'll be quite troubling to track as there has been a lot of changes since hmm

Mar 10 '25 09:03 NanoCode012

Then I'll solve it first by rolling back and using it.

Mar 10 '25 09:03 shing100

Yep, will recommend keeping the old transformers / peft version if those features work for you!

Mar 10 '25 09:03 NanoCode012

https://github.com/huggingface/trl/issues/2864

Mar 11 '25 08:03 shing100

What base model are you using? MPT? Are you using a pre-quantized base model?

Mar 11 '25 13:03 winglian

Can you share a full YAML configuration? I've been able to SFT a 70B on 2x8xH100 using Zero-3 + liger as well with minimal problems at a sequence length of 16k and batch size or 4.

Mar 11 '25 13:03 winglian

I'm learning llamafy models. The same is issue with Qwen2.5.

I'm using the settings below and I'm using zero3.json as the --deepspeed option.

Please let me know if there is anything wrong with the setting I am using

base_model: Qwen/Qwen2.5-32B-Instruct

plugins:
  - axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
liger_fused_linear_cross_entropy: true

strict: false

chat_template: tokenizer_default
datasets:
  - path: CarrotAI/Korean-Common
    type: chat_template
    field_messages: conversations
    message_property_mappings:
      role: from
      content: value

#dataset_exact_deduplication: true

default_system_message: "You are a helpful assistant. Please give a long and detailed answer."
dataset_prepared_path: last_run_prepared
val_set_size: 0.05
output_dir: /output/sft

# 16384, 8192
sequence_len: 8192
sample_packing: false
pad_to_sequence_len: true

wandb_project:
wandb_entity:
wandb_watch:
wandb_name: 
wandb_log_model:

gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 2
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 7e-6

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 100
evals_per_epoch: 0
eval_table_size:
eval_batch_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:

Mar 11 '25 23:03 shing100

cache_dir: ~/cache
environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_config_file: /data/axolotl/deepspeed_configs/zero3.json
  deepspeed_hostfile: /data/axolotl/hosts/hostfile
  deepspeed_multinode_launcher: pdsh
  zero3_init_flag: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_process_ip: [main_ip]
main_process_port: [main_port]
main_training_function: main
num_machines: 2
num_processes: 16
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Would my config file be weird? Could you share the config file that works normally?

Apr 27 '25 07:04 shing100

Hey @shing100 , have you tried giving another run from git latest a try again to see if the problem is solved? (Make sure to install the dependencies at the time)

Apr 28 '25 03:04 NanoCode012

It works fine if you run it with yes 0.4.1 version. Or if you run it with cpu_offload in the current version, But it takes a very long time.

Apr 28 '25 05:04 shing100

@shing100 , do you mean 0.8.1? What do you mean by it goes back?

Apr 28 '25 05:04 NanoCode012

yes 0.8.1 version

Apr 28 '25 05:04 shing100

@shing100 , sorry, could you remind us, what is the current issue now? It sounds like the original issue is fixed. Do you need to make a separate issue to keep things organized?

Apr 28 '25 05:04 NanoCode012

ImportError: cannot import name 'shard_checkpoint' from 'transformers.modeling_utils' (transformers 4.49.0)

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements