stanford_alpaca A brief summary of the potential issues during the replication and corresponding solutons

A brief summary of the potential issues during the replication and corresponding solutons

Open puyuanliu opened this issue 2 years ago • 2 comments

trafficstars

1. module transformers has no attribute LLaMATokenizer or 'missing key 'llama'.

First, install the SentencePiece then install transformers from huggingface git repo. i.e., pip install sentencepiece, pip install git+https://github.com/huggingface/transformers.git The installation order matters.

2. CUDA OOM at the beginning of the training.

Use -fp 16 instead of -bp 16. Lower the batch size and gradient accumulation steps.

3. CUDA OOM during model saving.

Assume you are using torch=1.13.0, change python/lib/python3.9/site packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:2224 from state_dict[fqn] = state_dict[fqn].clone().detach() to state_dict[fqn] = state_dict[fqn].cpu().clone().detach()

This usually happens when using GPUs of small memory (e.g., 40GB or 24GB)

4. How to perform inference?

Refer to https://github.com/tatsu-lab/stanford_alpaca/issues/35#issuecomment-1470985081

5. Generated tokens are not human-readable at inference time.

Assume your training goes well (e.g., training loss <0.5), it's most likely your model weights are corrupted during model saving. Make sure there is no error message during the saving.

6. Finetuning is slow.

Refer to https://github.com/tatsu-lab/stanford_alpaca/issues/32#issuecomment-1474203699

Mar 17 '23 21:03 puyuanliu

Hello my friend, like finding treasures in this issue. I had a QQ chat group. Are u willing to come in and help all Chinese friends. My QQ chat group number is: 397447632

Mar 20 '23 09:03 ZeyuTeng96

Regarding the CUDA OOM during model saving, with python 3.10: we should make the change in python3.10/site-packages/torch/distributed/fsdp/_state_dict_utils.py

Apr 03 '23 16:04 datquocnguyen

stanford_alpaca stanford_alpaca copied to clipboard

A brief summary of the potential issues during the replication and corresponding solutons

1. module transformers has no attribute LLaMATokenizer or 'missing key 'llama'.

2. CUDA OOM at the beginning of the training.

3. CUDA OOM during model saving.

4. How to perform inference?

5. Generated tokens are not human-readable at inference time.

6. Finetuning is slow.

stanford_alpaca
stanford_alpaca copied to clipboard