Yujia SUN

Results 6 comments of Yujia SUN

Hi @hirofumi0810 I have found that transformer_enc_pe_type 'add' is used with a lc_type 'reshape' in the mma example 'lc_transformer_mma_hie_subsample8_ma4H_ca4H_w16_from4L_64_64_32.yaml'. In this config, different chunks should have same position encoding. Don't...

Yes, I have modified prepare.sh and remove things about L and M subsets. And I think the disk space is not only thress times. Because even we work on subset...

Sorry, I will use Chinese. 我看了下lhotse/cut.py中compute_and_store_features_batch这部分代码。即使我们工作的是S子集,代码也是将全量的opus数据读入并进行特征提取计算(全量opus包含了S、M、L、非语音段)。相应的,上述特征矩阵会全量的存储到硬盘上,不区分子集。而代码后面会针对工作子集,额外生成相应的manifest文件并存储。这就导致,即使是S子集,硬盘占用也会异常升高。

Em... 这些我都懂,可能我没表达清楚。我理解我既然只想在S集上做实验,那我存储的特征应该只是S集相关的切割。可是,目前代码里面(如上图),会将全量的特征(未切分的音频的特征)都存储到硬盘上。你看我下面存储的特征,S集按1000份切分,其中一个小份的大小就已经3.6G了,明显有问题的。 ![image](https://user-images.githubusercontent.com/11547738/176820828-1228c661-0158-42df-8506-341a18f457f6.png) 然后,manifest_path,似乎是用来写中间结果。。。 ![image](https://user-images.githubusercontent.com/11547738/176820922-8fae218b-2218-4305-bfe4-fbae90986cf3.png)

Hi, @pzelasko. Sorry for using Chinese. The problem is that the generated feats_S_0001.lca file is quite large when I processing wenetspeech S subset. As the figure showed below, the feats...

@luomingshuang That's cool. Thanks for your solution. I will try.