lhotse wenetspeech occupy two much disk space even working on subset S

Hi, I am working on icefall wenetspeech egs. When processing the wenetspeech dataset, I found that all of the raw audio feature will be saved to disk which may take ~10TB disk space. So, is there any idea to only save the supervisions cutting feats as I just want to working on subset S?

thanks

Jul 01 '22 02:07 SoonSYJ

I think you can remove the scripts about processing the M and L subsets. And just keep the scripts for processing S. Yes, if you include the L subset feature files, the disk space will be large. At the same time, we did a time-domain augmentation, so the data will be three times.

Jul 01 '22 03:07 luomingshuang

You can do it in egs/wenetspeech/ASR/prepare.sh.

Jul 01 '22 03:07 luomingshuang

Yes, I have modified prepare.sh and remove things about L and M subsets. And I think the disk space is not only thress times. Because even we work on subset S, the raw long audio feature is saved on disk refer to line 4866 (lhotse/cut.py). You can find the feat_mat saved is the full audio feature.

Jul 01 '22 03:07 SoonSYJ

Sorry, I will use Chinese. 我看了下lhotse/cut.py中compute_and_store_features_batch这部分代码。即使我们工作的是S子集，代码也是将全量的opus数据读入并进行特征提取计算（全量opus包含了S、M、L、非语音段）。相应的，上述特征矩阵会全量的存储到硬盘上，不区分子集。而代码后面会针对工作子集，额外生成相应的manifest文件并存储。这就导致，即使是S子集，硬盘占用也会异常升高。

Jul 01 '22 03:07 SoonSYJ

Em...在wenetspeech数据集中的audio数据里，没有S, M, L等的标签，这些标签在WenetSpeech.json中，用lhotse处理以后，这些标签在supervision.jsonl.gz文件里面。

Jul 01 '22 03:07 luomingshuang

生成S特征的指令是这块：

if [ $stage -le 8 ] && [ $stop_stage -ge 8 ]; then
  log "Stage 8: Compute features for S"
  python3 ./local/compute_fbank_wenetspeech_splits.py \
    --training-subset S \
    --num-workers 20 \
    --batch-duration 600 \
    --start 0 \
    --num-splits $num_splits
fi

这块中，有个--training-subset S , 它会查找S对应的manifest文件。

Jul 01 '22 03:07 luomingshuang

    def compute_and_store_features_batch(
        self,
        extractor: FeatureExtractor,
        storage_path: Pathlike,
        manifest_path: Optional[Pathlike] = None,
        batch_duration: Seconds = 600.0,
        num_workers: int = 4,
        augment_fn: Optional[AugmentFn] = None,
        storage_type: Type[FW] = LilcomChunkyWriter,
        overwrite: bool = False,
    ) -> "CutSet":
        """
        Extract features for all cuts in batches.
        This method is intended for use with compatible feature extractors that
        implement an accelerated :meth:`~lhotse.FeatureExtractor.extract_batch` method.
        For example, ``kaldifeat`` extractors can be used this way (see, e.g.,
        :class:`~lhotse.KaldifeatFbank` or :class:`~lhotse.KaldifeatMfcc`).

这个函数中有个manifest_path，这个参数就是传给函数要计算特征的数据路径

Jul 01 '22 03:07 luomingshuang

Em... 这些我都懂，可能我没表达清楚。我理解我既然只想在S集上做实验，那我存储的特征应该只是S集相关的切割。可是，目前代码里面（如上图），会将全量的特征（未切分的音频的特征）都存储到硬盘上。你看我下面存储的特征，S集按1000份切分，其中一个小份的大小就已经3.6G了，明显有问题的。然后，manifest_path，似乎是用来写中间结果。。。

Jul 01 '22 04:07 SoonSYJ

I assume you guys have got this, but if you need me to chime in and help with sth just ping me (unfortunately I can't read Chinese but it'd be cool to learn it one day).

Jul 01 '22 20:07 pzelasko

Hi, @pzelasko. Sorry for using Chinese. The problem is that the generated feats_S_0001.lca file is quite large when I processing wenetspeech S subset. As the figure showed below, the feats saved of one S subset split is around 2GB. As we have 1000 splits, the disk occpupation will be 2TB for S subset. So I read the code further more, and I found that line 4866 at lhotse/cut.py have saved the full feats without cutting on the disk even we are working on S subset which is quite strange as the figure shown below.

I am not sure if you have any idea to make this feats saving strategy more efficiency because if ocuppy two much disk space even we are working on S subset. Thanks

Jul 04 '22 01:07 SoonSYJ

This doesn't look right to me as the S subset has only 100h of speech. Can you show the output of lhotse cut describe <path-to-cuts.jsonl> of the manifest for which you are computing the features?

Jul 04 '22 01:07 pzelasko

$ lhotse cut describe ./data/fbank/cuts_S_raw.jsonl.gz
Cuts count: 130991
Total duration (hours): 61099.6
Speech duration (hours): 302.3 (0.5%)
***
Duration statistics (seconds):
mean    1679.2
std     1247.2
min     12.6
25%     588.0
50%     1475.9
75%     2636.7
99%     5748.9
99.5%   6319.2
99.9%   8416.8
max     15827.8

I think there must be some issues with the preparation of the S subset.

Jul 04 '22 01:07 csukuangfj

@pkufool

As the wenetspeech recipe was added by you, could you take a look?

Jul 04 '22 01:07 csukuangfj

I also have a test in my machine.

>> lhotse cut describe ./data/fbank/cuts_S_raw.jsonl.gz
Cuts count: 130991
Total duration (hours): 61099.6
Speech duration (hours): 302.3 (0.5%)
***
Duration statistics (seconds):
mean    1679.2
std     1247.2
min     12.6
25%     588.0
50%     1475.9
75%     2636.7
99%     5748.9
99.5%   6319.2
99.9%   8416.8
max     15827.8

>> lhotse cut describe ./data/fbank/cuts_S.jsonl.gz
Cuts count: 454779
Total duration (hours): 302.2
Speech duration (hours): 302.2 (100.0%)
***
Duration statistics (seconds):
mean    2.4
std     1.8
min     0.2
25%     1.4
50%     2.0
75%     2.9
99%     8.0
99.5%   8.7
99.9%   11.9
max     405.1

Jul 04 '22 03:07 luomingshuang

Here, I use vim to look the file cuts_S_raw.0001.jsonl.gz and it shows as follows: 20f6f2a48b71639a1b9e636ba2baf83

As we can see in the above picture, each long monocut includes many small cuts. In fact, these small cuts are the useful and training cuts. I also have a look for the function compute_and_store_features_batch. It extracts the features for the longest monocuts and saves it. So it causes the saved feat file so big. At the same time, in the file https://github1s.com/k2-fsa/icefall/blob/HEAD/egs/wenetspeech/ASR/local/compute_fbank_wenetspeech_splits.py, the cut_set.compute_and_store_features_batch is before cut_set.trim_to_supervisions, such as:

        cut_set = cut_set.compute_and_store_features_batch(
            extractor=extractor,
            storage_path=f"{output_dir}/feats_{subset}_{idx}",
            num_workers=args.num_workers,
            batch_duration=args.batch_duration,
            storage_type=LilcomChunkyWriter,
        )

        logging.info("About to split cuts into smaller chunks.")
        cut_set = cut_set.trim_to_supervisions(
            keep_overlapping=False, min_duration=None
        )

So, the solution for this question is to take the cut_set.trim_to_supervisions before the cut_set.compute_and_store_features_batch, such as :

        logging.info("About to split cuts into smaller chunks.")
        cut_set = cut_set.trim_to_supervisions(
            keep_overlapping=False, min_duration=None
        )
        cut_set = cut_set.compute_and_store_features_batch(
            extractor=extractor,
            storage_path=f"{output_dir}/feats_{subset}_{idx}",
            num_workers=args.num_workers,
            batch_duration=args.batch_duration,
            storage_type=LilcomChunkyWriter,
        )

Of course, the two above methods don't influence our training results. But it will influence the size of the saved feat files. @SoonSYJ , maybe you can have a try. I also will have try and change the codes in icefall.

Jul 04 '22 03:07 luomingshuang

I have tested the method I said above. The results are as follows:

>>ls -lht data/fbank/S_split_1000_ori/feats_S_0001.h5
-rw-r--r-- 1 luomingshuang luomingshuang 2.9G 4月  19 11:05 data/fbank/S_split_1000_ori/feats_S_0001.h5
>> ls -lht data/fbank/S_split_1000/feats_S_0001.h5 
-rw-r--r-- 1 luomingshuang luomingshuang 13M 7月   4 11:19 data/fbank/S_split_1000/feats_S_0001.h5

The 2.9G is the size of the saved feature file before changing the codes. The 13M is the size of the saved feature file after changing codes. I will modify the codes in icefall.

Jul 04 '22 03:07 luomingshuang

@luomingshuang

I also have a test in my machine.

Is that normal? The output says the S subset has more than 61k hours of data.

Jul 04 '22 03:07 csukuangfj

The useful duration is Speech duration (hours): 302.3 (0.5%). After trim_to_supervision, it (including time domain augmentation) shows:

>> lhotse cut describe ./data/fbank/cuts_S.jsonl.gz
Cuts count: 454779
Total duration (hours): 302.2
Speech duration (hours): 302.2 (100.0%)
***
Duration statistics (seconds):
mean    2.4
std     1.8
min     0.2
25%     1.4
50%     2.0
75%     2.9
99%     8.0
99.5%   8.7
99.9%   11.9
max     405.1

So it is normal.

Jul 04 '22 03:07 luomingshuang

Probably the issue is that the S subset may contain short segments of long files, and if we are saving the entire audio for long files, that is not right. It may even be duplicating audio files if the same file appears, in different segments, in different subsets. (I don't know the details, it may even repeat audio files within the same subset.)

Jul 04 '22 03:07 danpovey

Yes, so here, we should do cut_set.trim_to_supervisions to split long cuts to short cuts before computing and saving feature.

Jul 04 '22 03:07 luomingshuang

@luomingshuang That's cool. Thanks for your solution. I will try.

Jul 04 '22 05:07 SoonSYJ

Yeah this is right. Apparently Wenetspeech provides lots of very long recordings with partial transcription. If you are not going to use the untrancribed/unlabeled parts of the recordings, it’s best to trim to supervisions first.

Jul 04 '22 11:07 pzelasko

lhotse lhotse copied to clipboard

wenetspeech occupy two much disk space even working on subset S

lhotse
lhotse copied to clipboard