LISA icon indicating copy to clipboard operation
LISA copied to clipboard

The model "llava-7b-llama-2-7b-chat" merged by myself had problems during training.

Open zhangyupeng123 opened this issue 1 year ago • 6 comments

Hello, we have merged the model "zhangyupeng/llava-7b-llama-2-7b-chat" by ourselves. Two 3090 Gpus are used for training, Batch_size=2 and grad_accumulation_steps=40. The following problems appear during training. Is this the reason for our own merged models?

Traceback (most recent call last): File "/home/zhangyupeng/anaconda3/envs/lisa/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop data = fetcher.fetch(index) File "/home/zhangyupeng/anaconda3/envs/lisa/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch return self.collate_fn(data) File "/mnt/21T/zhangyupeng/code/LISA/utils/dataset.py", line 135, in collate_fn assert cur_len == total_len AssertionError

[2023-09-05 20:58:14,118] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 77018 [2023-09-05 20:58:14,119] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 77019 [2023-09-05 20:58:15,023] [ERROR] [launch.py:321:sigkill_handler] ['/home/zhangyupeng/anaconda3/envs/lisa/bin/python', '-u', 'train_ds.py', '--local_rank=1'] exits with return code = 1

zhangyupeng123 avatar Sep 05 '23 13:09 zhangyupeng123

I think this is caused by datasets. Can you check whether the datasets are correctly organized?

X-Lai avatar Sep 05 '23 15:09 X-Lai

Hi~@X-Lai , After we downloaded and unzipped the dataset, we changed the file name according to your request and uploaded it to the server. Are you saying there are other changes that need to be made?

zhangyupeng123 avatar Sep 06 '23 04:09 zhangyupeng123

I feel like this is a model problem. Because I can run llama-13b but can't run merged llama-7b. And I don't know how to solve this.

dddraxxx avatar Oct 02 '23 03:10 dddraxxx

I solved this. Just add legacy=True in

tokenizer = transformers.AutoTokenizer.from_pretrained(
        args.version,
        cache_dir=None,
        model_max_length=args.model_max_length,
        padding_side="right",
        use_fast=False,
        legacy=True
    )

Refer to link

dddraxxx avatar Oct 06 '23 02:10 dddraxxx

Worked for me, thanks!

AmrinKareem avatar Oct 26 '23 11:10 AmrinKareem

Worked for me, thanks!

Dear @AmrinKareem ,

I met the same issue. And this solution also works for me. May I ask if it will affect the results of training the LISA model?

It would be super helpful for me.

Best regards and many thanks,

Amazingren avatar Aug 17 '24 17:08 Amazingren