lilei
lilei
File "/home/ec2-user/SageMaker/LOMO/src/lomo.py", line 114, in func if self.loss_scaler and self.loss_scaler.has_overflow_serial or self.loss_scaler._has_inf_or_nan(p.grad): AttributeError: 'NoneType' object has no attribute '_has_inf_or_nan' , 换成了 llama-alpaca, 报上述错误。
按照这个commit 训练, 修改eval方法后, 但是训练占用显存仍然过高, 尤其是7b的 max_input_len=1024 都会显存溢出(4卡24G A10 全部用上), 33b的 max_input_len = 256 也会显存溢出(40G显存*6卡全能用上)。
训练时溢出, 但是max_input_len 设的短可以训练(33b, max_input_len=64, 6卡40G显卡可训练); gradient_checkpointing=True 在 args_lomo.yaml 已设置; 显存降低不太明显, 没有出现二分类示例中的 类似降维打印。
修改点: (1)修改了mydataset.py 中process部分:输入数据为一行一个json样本: 类似于 class MyDataset(Dataset): def __init__(self, data_args, tokenizer, split): super().__init__() self.data_args = data_args self.tokenizer = tokenizer self.split = split self.sample_size = 300000 #self.sample_size = dataset_info.sample_size #self.prompt_type = dataset_info.prompt_type...
数据集类似一行一个json样本, 一般是由 instruction作为source, output作为target {"instruction": "A conversation takes place between Amy and his or her friend. Kevin responded to his or her friend's questions with everyday, humorous, witty answers. Amy...
数据都用的是标准wic: 7b model, max_seq_len = 2048, 16 per train size, 16 per val size, 6卡 用满平均每卡占用37G显存。 33b model, max seq len = 1024, 1 per train size, 1per val size,...
所以很奇怪用自己的数据集 33b model max_seq_len 在 1 per train size, 1per val size 下, max_seq_len只能达到64, 否则就会out of memory
I found some textGrid files does not have sil(sil, sp, spn) , but other files have . I used mfa tool and use the token "english_us_arpa english_us_arpa" as model. why...
tks, that helps a lot.