axolotl icon indicating copy to clipboard operation
axolotl copied to clipboard

ImportError: cannot import name 'shard_checkpoint' from 'transformers.modeling_utils' (transformers 4.49.0)

Open shing100 opened this issue 9 months ago • 21 comments

Please check that this issue hasn't been reported before.

  • [x] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

When merging a DPO QLoRA model, I encountered the following error:

ImportError: cannot import name 'shard_checkpoint' from 'transformers.modeling_utils'

This issue does not occur with transformers 4.46.3, but it happens with 4.49.0.

Environment

  • Python: 3.11.12
  • transformers: 4.49.0
  • axolotl: 0.7.1

Steps to Reproduce

  1. Install transformers 4.49.0
  2. Run the following command:
    from transformers.modeling_utils import shard_checkpoint
    
  3. See the ImportError

Expected behavior
The function shard_checkpoint should be accessible as it was in transformers 4.46.3.

Additional context
Has shard_checkpoint been deprecated or moved in transformers 4.49.0? If so, what is the recommended alternative?

Current behaviour

ImportError: cannot import name 'shard_checkpoint' from 'transformers.modeling_utils'

Steps to reproduce

  1. dpo qlora train
  2. python3 -m axolotl.cli.merge_lora .../...yaml --lora_model_dir:"/../../../"

Config yaml


Possible solution

No response

Which Operating Systems are you using?

  • [x] Linux
  • [ ] macOS
  • [ ] Windows

Python Version

3.11

axolotl branch-commit

main

Acknowledgements

  • [x] My issue title is concise, descriptive, and in title casing.
  • [x] I have searched the existing issues to make sure this bug has not been reported yet.
  • [x] I am using the latest version of axolotl.
  • [x] I have provided enough information for the maintainers to reproduce and diagnose the issue.

shing100 avatar Mar 06 '25 10:03 shing100

@shing100 , hey, could you provide us the stack trace for this? I don't see any explicit calls to that function on our end.

NanoCode012 avatar Mar 07 '25 11:03 NanoCode012

They had a deprecation warning for a while as well https://github.com/huggingface/transformers/blob/c2820c94916e34baf4486accae74760972183a2f/src/transformers/modeling_utils.py#L400-L403

NanoCode012 avatar Mar 07 '25 11:03 NanoCode012

Image

I used git pull to get it and then installed it. I will delete it again and then install it.

shing100 avatar Mar 09 '25 02:03 shing100

After updating the repository, multi-GPU learning is using too much memory. It takes 343GB of memory to train an 8B model. This is very wrong. Previously, training a 32B model with H100*16 was possible (Multi-GPU environment(2 nodes) + deepspeed zero3 + liger-kernal), but now it is not possible.

shing100 avatar Mar 10 '25 04:03 shing100

Looking at the stack trace, are you doing AWQ training?

After updating the repository, multi-GPU learning is using too much memory. It takes 343GB of memory to train an 8B model. This is very wrong.

Do you perhaps have the logs between which to which commit you were on? Or are able to narrow down to potential offending commits? That's too much memory usage for 8B.

NanoCode012 avatar Mar 10 '25 06:03 NanoCode012

The first picture above is an error when using the merge method after dpo learning using qlora.

Image

When SFT training the 7.8b model with 2 nodes (H100*8), we use a total of 454.08 GiB.

  • Liger-kernal + deepspeed zero3
  • micro batch size 1
  • sequence_len 8192

Learning the 32b model with the same setting results in oom.

shing100 avatar Mar 10 '25 06:03 shing100

Hey, let's try to tackle the original issue first.

Does the shard_checkpoint issue still exist on merge? Maybe run a training for max_steps: 10 or with save_steps: 5 to then get it to merge?

Regarding the second issue, do you have a timeframe from whence you upgraded your codebase from so it's possible to track down this problem?

NanoCode012 avatar Mar 10 '25 07:03 NanoCode012

The first issue has been resolved since the reinstallation, thank you.

After using the v0.5.0 version, I proceeded with the update this time, so tracking is difficult..

shing100 avatar Mar 10 '25 09:03 shing100

The first issue has been resolved since the reinstallation, thank you.

Thanks for clarifying.

After using the v0.5.0 version, I proceeded with the update this time, so tracking is difficult..

That'll be quite troubling to track as there has been a lot of changes since hmm

NanoCode012 avatar Mar 10 '25 09:03 NanoCode012

Then I'll solve it first by rolling back and using it.

shing100 avatar Mar 10 '25 09:03 shing100

Yep, will recommend keeping the old transformers / peft version if those features work for you!

NanoCode012 avatar Mar 10 '25 09:03 NanoCode012

https://github.com/huggingface/trl/issues/2864

shing100 avatar Mar 11 '25 08:03 shing100

What base model are you using? MPT? Are you using a pre-quantized base model?

winglian avatar Mar 11 '25 13:03 winglian

Can you share a full YAML configuration? I've been able to SFT a 70B on 2x8xH100 using Zero-3 + liger as well with minimal problems at a sequence length of 16k and batch size or 4.

winglian avatar Mar 11 '25 13:03 winglian

I'm learning llamafy models. The same is issue with Qwen2.5.

I'm using the settings below and I'm using zero3.json as the --deepspeed option.

Please let me know if there is anything wrong with the setting I am using

base_model: Qwen/Qwen2.5-32B-Instruct

plugins:
  - axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
liger_fused_linear_cross_entropy: true

strict: false

chat_template: tokenizer_default
datasets:
  - path: CarrotAI/Korean-Common
    type: chat_template
    field_messages: conversations
    message_property_mappings:
      role: from
      content: value

#dataset_exact_deduplication: true

default_system_message: "You are a helpful assistant. Please give a long and detailed answer."
dataset_prepared_path: last_run_prepared
val_set_size: 0.05
output_dir: /output/sft

# 16384, 8192
sequence_len: 8192
sample_packing: false
pad_to_sequence_len: true

wandb_project:
wandb_entity:
wandb_watch:
wandb_name: 
wandb_log_model:

gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 2
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 7e-6

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 100
evals_per_epoch: 0
eval_table_size:
eval_batch_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:

shing100 avatar Mar 11 '25 23:03 shing100

cache_dir: ~/cache
environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_config_file: /data/axolotl/deepspeed_configs/zero3.json
  deepspeed_hostfile: /data/axolotl/hosts/hostfile
  deepspeed_multinode_launcher: pdsh
  zero3_init_flag: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_process_ip: [main_ip]
main_process_port: [main_port]
main_training_function: main
num_machines: 2
num_processes: 16
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Would my config file be weird? Could you share the config file that works normally?

shing100 avatar Apr 27 '25 07:04 shing100

Hey @shing100 , have you tried giving another run from git latest a try again to see if the problem is solved? (Make sure to install the dependencies at the time)

NanoCode012 avatar Apr 28 '25 03:04 NanoCode012

It works fine if you run it with yes 0.4.1 version. Or if you run it with cpu_offload in the current version, But it takes a very long time.

shing100 avatar Apr 28 '25 05:04 shing100

@shing100 , do you mean 0.8.1? What do you mean by it goes back?

NanoCode012 avatar Apr 28 '25 05:04 NanoCode012

yes 0.8.1 version

shing100 avatar Apr 28 '25 05:04 shing100

@shing100 , sorry, could you remind us, what is the current issue now? It sounds like the original issue is fixed. Do you need to make a separate issue to keep things organized?

NanoCode012 avatar Apr 28 '25 05:04 NanoCode012