lilei comments

Results 9 comments of


                                            lilei

数据集问题

File "/home/ec2-user/SageMaker/LOMO/src/lomo.py", line 114, in func if self.loss_scaler and self.loss_scaler.has_overflow_serial or self.loss_scaler._has_inf_or_nan(p.grad): AttributeError: 'NoneType' object has no attribute '_has_inf_or_nan' ，换成了 llama-alpaca，报上述错误。

数据集问题

按照这个commit 训练，修改eval方法后，但是训练占用显存仍然过高，尤其是7b的 max_input_len=1024 都会显存溢出（4卡24G A10 全部用上）， 33b的 max_input_len = 256 也会显存溢出（40G显存*6卡全能用上）。

数据集问题

训练时溢出，但是max_input_len 设的短可以训练（33b, max_input_len=64, 6卡40G显卡可训练）； gradient_checkpointing=True 在 args_lomo.yaml 已设置；显存降低不太明显，没有出现二分类示例中的类似降维打印。

数据集问题

修改点：（1）修改了mydataset.py 中process部分：输入数据为一行一个json样本：类似于 class MyDataset(Dataset): def __init__(self, data_args, tokenizer, split): super().__init__() self.data_args = data_args self.tokenizer = tokenizer self.split = split self.sample_size = 300000 #self.sample_size = dataset_info.sample_size #self.prompt_type = dataset_info.prompt_type...

数据集问题

数据集类似一行一个json样本，一般是由 instruction作为source， output作为target {"instruction": "A conversation takes place between Amy and his or her friend. Kevin responded to his or her friend's questions with everyday, humorous, witty answers. Amy...

数据集问题

数据都用的是标准wic: 7b model, max_seq_len = 2048, 16 per train size, 16 per val size, 6卡用满平均每卡占用37G显存。 33b model， max seq len = 1024, 1 per train size, 1per val size,...

数据集问题

所以很奇怪用自己的数据集 33b model max_seq_len 在 1 per train size, 1per val size 下， max_seq_len只能达到64，否则就会out of memory

dataset.py ： codes.size(-1)= 48960， sum(duration）=38678 assert error:

I found some textGrid files does not have sil(sil， sp, spn) ， but other files have . I used mfa tool and use the token "english_us_arpa english_us_arpa" as model. why...

dataset.py ： codes.size(-1)= 48960， sum(duration）=38678 assert error:

tks, that helps a lot.