xianghuisun comments

Results 68 comments of


xianghuisun

finetune报错ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9)

> 之前在v100 32G_4 上也报OOM，然后换到了一台A100 80G_1 就正常跑通了。现在v100 32G*4上会报标题的错误。而两台机器，8张v100 32G，用scripts/multinode_run.sh 还报这个错误，请问是显存不够的原因吗？没有打印其他日志 > > ``` > [INFO|modeling_utils.py:2263] 2023-05-29 11:08:21,313 >> Offline mode: forcing local_files_only=True > [INFO|modeling_utils.py:2531] 2023-05-29 11:08:21,313 >> loading weights...

OutOfMemoryError: CUDA out of memory when saving weight

麻烦给出一份详细的报错日志

finetune多卡报错binascii.Error: Incorrect padding

> File "/opt/conda/lib/python3.8/site-packages/transformers/deepspeed.py", line 67, in __init__ > super().__init__(config_file_or_dict) > File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/deepspeed.py", line 52, in __init__ > config_decoded = base64.urlsafe_b64decode(config_file_or_dict).decode("utf-8") 看样子是读取deepspeed_config_stage3.json 报错了可是这个json文件不应该存在读取错误啊，不存在格式问题的

Required library version not found: libbitsandbytes_cuda100_nocublaslt.so.

> 这个问题怎么解决， cuda版本的问题? > > CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64... CUDA SETUP: CUDA version lower than 11 are currently not supported for LLM.int8(). You...

大佬 7b bloom 大约需要多大的内存在非lora 的情况下，单机三张24 3090可以吗？

> 大佬 7b bloom 大约需要多大的内存在非lora 的情况下，单机三张24 3090可以吗？或者局域网环境下两张3090两张4090可以跑起来吗？两张3090，在zero_stage=3+cpu offload的配置下应该可以跑起来

如何保存每一个epoch的模型

> 当我把下面这块模型保存代码移入epoch循环， if args.output_dir is not None: print_rank_0('saving the final model ...', args.global_rank)#It will overwrite the last epoch model model = convert_lora_to_linear_layer(model) > > ``` > if args.global_rank == 0:...

finetune之后回答的很多重复内容，不finetune直接推理不会出现重复问题。应该是finetune代码有问题

> https://github.com/LianjiaTech/BELLE/tree/4f84c89372b435bae039b47f1f31078b1c6fc23e/train 您微调的基础模型是哪个？这个问题的原因有以下几种可能： 1. 基础模型的问题 2. pad_token_id和eos_token_id没有设置对，要确保pad_token_id=0， eos_token_id=2(对于LLaMA模型，不同的transformers版本加载的tokenizer有可能出现pad_token_id和eos_token_id不一致)

finetune之后回答的很多重复内容，不finetune直接推理不会出现重复问题。应该是finetune代码有问题

> 基础模型用的bloom 之前的代码我们都是在A100上进行实验，在V100上微调Bloom模型需要改一些参数配置，有可能存在问题。我们会尝试基于之前的代码在V100上复现这个问题。

数据加载阶段程序挂了

> 你好，我在45G的8卡上训练bloom-7b, 数据上加了一些中英文单语进去，共计6000w左右；然后数据在加载到30多万的时候就崩了。。。请问这个一般是什么原因导致的啊？ > > `length of train_dataset(after get_train_data): 59719579 100%|██████████| 1/1 [00:00

开源的BELLE/train/main.py只支持指令跟随数据集，不支持类似sharegpt多轮对话，不能复现论文。

> 你好，https://github.com/LianjiaTech/BELLE/blob/main/train/reproduce_our_papers/Towards%20Better%20Instruction%20Following%20Language%20Models%20for%20Chinese:%20Investigating%20the%20Impact%20of%20Training%20Data%20and%20Evaluation.md 里面提到可以使用 https://github.com/LianjiaTech/BELLE/blob/main/train/README.md 里面的main.py进行复现。 > > ![image](https://user-images.githubusercontent.com/233871/233967018-e0873333-2231-4685-8e49-f2393f5a81ac.png) > > 我看了BELLE\train\utils\data\raw_datasets.py文件中，对数据集的处理方式只有指令跟随。 ![image](https://user-images.githubusercontent.com/233871/233967552-c20c8cf3-7740-402f-b7be-45b23fc14b1b.png) > > 没有对上述对话的处理方式。想问下多轮对话的数据处理方式是什么？是的，我们目前还不支持多轮对话的数据处理方式。非常抱歉，最迟明天更新多轮对话的处理逻辑。感谢您的关注。