shihuai comments

Results 8 comments of


                                            shihuai

Pretrain Hubert on english and chinese speech dataset.

> Can you show the config of your training? I use the hubert_base_librispeech.yaml for pretraining, only change the ddp_backend and max_sample_size. ``` common: fp16: true log_format: json log_interval: 200 seed:...

Pretrain Hubert on english and chinese speech dataset.

> We believe that the key of training hubert base model is to look at the performance of the pre-trained model on main downstream tasks. You can finetune the pre-trained...

Pretrain Hubert on english and chinese speech dataset.

> > > We believe that the key of training hubert base model is to look at the performance of the pre-trained model on main downstream tasks. You can finetune...

模型训练loss变化是什么样的？

> > 一样，loss快速从8.x降到3.x 效果还没评估看论文似乎也没什么特别的trick，我想后续可能试试把generator部分改成自回归模型和CEloss或许会好一点？但难解决文本+音频流式推理问题 > > 找到bug了，是我的target_units设置有问题，默认用0来pad了，应该用-100 我们目前也是这样设置的，但收敛效果还是不好。

模型训练loss变化是什么样的？

> 又找到bug了源码里generator的llama是用的LlamaDecoderLayer，但不知道是不是我的transformers版本问题，attention mask维度不匹配。看了一下llama源码，在LlamaModel里会先做attention mask的升维。所以我之前直接用的LlamaModel，输入input_embedding=hidden_states这么搞的。但是其实LlamaModel里会对LlamaDecoderLayer出来的hidden_states做一次norm操作，而llama omni源码用的hidden_states是未经过norm的。可能是这里的区别。我目前loss从18在逐渐收敛，半个epoch到6.x了，看起来还在收敛。等我有结论再同步hhhh 有兴趣也可以邮箱交流一下：[[email protected]](mailto:[email protected]) 你说的这些情况我们也遇到过，后面是通过手动将attention mask扩展到4维。不过我们还没有做过推理，情况应该也会很差。