ChatGLM-6B icon indicating copy to clipboard operation
ChatGLM-6B copied to clipboard

[BUG/Help] <Full parameter finetuning时运行ds_train_finetune.sh报错>

Open w-tz opened this issue 1 year ago • 2 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /data/d2022/qs/wtz/ChatGLM-6B/ptuning/main.py:430 in │ │ │ │ 427 │ │ 428 │ │ 429 if name == "main": │ │ ❱ 430 │ main() │ │ 431 │ │ │ │ /data/d2022/qs/wtz/ChatGLM-6B/ptuning/main.py:369 in main │ │ │ │ 366 │ │ # checkpoint = last_checkpoint │ │ 367 │ │ model.gradient_checkpointing_enable() │ │ 368 │ │ model.enable_input_require_grads() │ │ ❱ 369 │ │ train_result = trainer.train(resume_from_checkpoint=checkpoint) │ │ 370 │ │ # trainer.save_model() # Saves the tokenizer too for easy upload │ │ 371 │ │ │ │ 372 │ │ metrics = train_result.metrics │ │ │ │ /data/d2022/qs/wtz/ChatGLM-6B/ptuning/trainer.py:1635 in train │ │ │ │ 1632 │ │ inner_training_loop = find_executable_batch_size( │ │ 1633 │ │ │ self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size │ │ 1634 │ │ ) │ │ ❱ 1635 │ │ return inner_training_loop( │ │ 1636 │ │ │ args=args, │ │ 1637 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │ │ 1638 │ │ │ trial=trial, │ │ │ │ /data/d2022/qs/wtz/ChatGLM-6B/ptuning/trainer.py:1704 in _inner_training_loop │ │ │ │ 1701 │ │ │ or self.fsdp is not None │ │ 1702 │ │ ) │ │ 1703 │ │ if args.deepspeed: │ │ ❱ 1704 │ │ │ deepspeed_engine, optimizer, lr_scheduler = deepspeed_init( │ │ 1705 │ │ │ │ self, num_training_steps=max_steps, resume_from_checkpoint=resume_from_c │ │ 1706 │ │ │ ) │ │ 1707 │ │ │ self.model = deepspeed_engine.module │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ TypeError: deepspeed_init() got an unexpected keyword argument 'resume_from_checkpoint' Running tokenizer on train dataset: 4%|████▎ | 5/115 [00:04<01:34, 1.16ba/s][2023-07-18 09:39:05,315] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 59378 [2023-07-18 09:39:05,315] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 59379 [2023-07-18 09:39:07,453] [ERROR] [launch.py:324:sigkill_handler] ['/home/qs/anaconda3/bin/python', '-u', 'main.py', '--local_rank=1', '--deepspeed', 'deepspeed.json', '--do_train', '--train_file', 'AdvertiseGen/train.json', '--test_file', 'AdvertiseGen/dev.json', '--prompt_column', 'content', '--response_column', 'summary', '--overwrite_cache', '--model_name_or_path', 'THUDM/chatglm-6b', '--output_dir', './output/adgen-chatglm-6b-ft-1e-4', '--overwrite_output_dir', '--max_source_length', '64', '--max_target_length', '64', '--per_device_train_batch_size', '4', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--predict_with_generate', '--max_steps', '5000', '--logging_steps', '10', '--save_steps', '1000', '--learning_rate', '1e-4', '--fp16'] exits with return code = 1

Expected Behavior

No response

Steps To Reproduce

pip install deepspeed bash ds_train_finetune.sh

Environment

- OS:ubuntu 18.04
- Python:3.9.13
- Transformers:4.27.1
- PyTorch:11.7
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :11.7

Anything else?

No response

w-tz avatar Jul 18 '23 01:07 w-tz

same issue.

WJ-Fifth avatar Oct 26 '23 03:10 WJ-Fifth

same issue

enddlesswm avatar Nov 06 '23 09:11 enddlesswm