lhotse how much shared memory and disk memory do i need to process the S subset of wenetspeech dataset?

insufficient shm insufficent disk mem? here is my docker info:

Aug 29 '23 06:08 SaltedSlark

I’m guessing this is related to IPC of data loading workers for batch feat computation and could be related to too many workers/too large batches; but judging by the warning about max_duration, did you trim your cut set to supervisions? Can you show the output “lhotse cut describe cuts.jsonl.gz”? I think you might be computing features for very long cuts (and you probably don’t need this).

Aug 29 '23 11:08 pzelasko

I’m guessing this is related to IPC of data loading workers for batch feat computation and could be related to too many workers/too large batches; but judging by the warning about max_duration, did you trim your cut set to supervisions? Can you show the output “lhotse cut describe cuts.jsonl.gz”? I think you might be computing features for very long cuts (and you probably don’t need this).

Thanks for ur reply! I revised the num_workers to 0, and this happened:

/bin/bash: /home/zj/anaconda3/envs/vall-e/lib/libtinfo.so.6: no version information available (required by /bin/bash)
2023-08-30 10:26:50 (prepare.sh:59:main) Stage 1: Prepare wenetspeech manifest
2023-08-30 10:26:50 (prepare.sh:71:main) Stage 2: Tokenize/Fbank wenetspeech
2023-08-30 10:27:06,501 INFO [tokenizer.py:160] dataset_parts: ['S'] manifests {'S': {'recordings': RecordingSet(len=43664), 'supervisions': SupervisionSet(len=151600)}}
2023-08-30 10:27:06,507 INFO [tokenizer.py:167] Processing partition: S CUDA: True
Computing features in batches:   0%|                                                      | 0/43664 [00:00<?, ?it/s]/home/zj/workspace/TTS/lhotse/lhotse/dataset/sampling/simple.py:216: UserWarning: The first cut drawn in batch collection violates the max_frames, max_cuts, or max_duration constraints - we'll return it anyway. Consider increasing max_frames/max_cuts/max_duration.
  warnings.warn(
Computing features in batches:   0%|                                                      | 0/43664 [00:14<?, ?it/s]
Traceback (most recent call last):
  File "/home/zj/workspace/TTS/vall-e/egs/wenetspeech/bin/tokenizer.py", line 268, in <module>
    main()
  File "/home/zj/workspace/TTS/vall-e/egs/wenetspeech/bin/tokenizer.py", line 204, in main
    cut_set = cut_set.compute_and_store_features_batch(
  File "/home/zj/workspace/TTS/lhotse/lhotse/cut/set.py", line 2308, in compute_and_store_features_batch
    features = extractor.extract_batch(
  File "/home/zj/workspace/TTS/vall-e/valle/data/tokenizer.py", line 348, in extract_batch
    encoded_frames = self.tokenizer.encode(samples.detach().to(device))
  File "/home/zj/workspace/TTS/vall-e/valle/data/tokenizer.py", line 239, in encode
    return self.codec.encode(wav.to(self.device))
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/encodec/model.py", line 144, in encode
    encoded_frames.append(self._encode_frame(frame))
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/encodec/model.py", line 161, in _encode_frame
    emb = self.encoder(x)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/encodec/modules/seanet.py", line 144, in forward
    return self.model(x)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/torch/nn/modules/container.py", line 204, in forward
    input = module(input)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/encodec/modules/seanet.py", line 63, in forward
    return self.shortcut(x) + self.block(x)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/torch/nn/modules/container.py", line 204, in forward
    input = module(input)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/encodec/modules/conv.py", line 204, in forward
    x = pad1d(x, (padding_total, extra_padding), mode=self.pad_mode)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/encodec/modules/conv.py", line 92, in pad1d
    padded = F.pad(x, paddings, mode, value)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.14 GiB (GPU 0; 23.65 GiB total capacity; 21.73 GiB already allocated; 104.06 MiB free; 21.73 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

does it mean that the recodingset or Supervisionset is too long for my gpu devices(RTX 4090 24GB)? and what should i do to avoid this?

Aug 30 '23 02:08 SaltedSlark

Try cuts = cuts.trim_to_supervisions() before feature extraction and then you can also use multiple workers again.

Aug 30 '23 02:08 pzelasko

Try cuts = cuts.trim_to_supervisions() before feature extraction and then you can also use multiple workers again.

thanks! like this? before: after:

Aug 30 '23 02:08 SaltedSlark

Yeah

Aug 30 '23 13:08 pzelasko

Yeah

thanks! I met another problem when I try to train my vall-e model on S subset: I have no idea what is wrong, looking for your rely, much love!

Sep 01 '23 02:09 SaltedSlark

Looks like not every training example has features extracted. Make sure you passed the path to the right cut set (with features). You can also check ‘lhotse cut describe ’ it will show you some stats about the data.

Sep 01 '23 03:09 pzelasko

Looks like not every training example has features extracted. Make sure you passed the path to the right cut set (with features). You can also check ‘lhotse cut describe ’ it will show you some stats about the data. okay, and here is the status of my cut_train.jsonl.gz looks like features num is much smaller than cuts count? is that something wrong?and why it happend?

Sep 01 '23 03:09 SaltedSlark

Looks like not every training example has features extracted. Make sure you passed the path to the right cut set (with features). You can also check ‘lhotse cut describe ’ it will show you some stats about the data. okay, and here is the status of my cut_train.jsonl.gz looks like features num is much smaller than cuts count? is that something wrong?and why it happend? I combine two sets to get the cut_train set and I found one of them has 0 feature...

Sep 01 '23 03:09 SaltedSlark

Silence is over 90%??

On Fri, Sep 1, 2023, 11:15 AM ZhangJiang @.***> wrote:

Looks like not every training example has features extracted. Make sure you passed the path to the right cut set (with features). You can also check ‘lhotse cut describe ’ it will show you some stats about the data. okay, and here is the status of my cut_train.jsonl.gz [image: image] https://user-images.githubusercontent.com/32287808/264909381-b549ca50-76fa-4259-bec8-7c886e7a2e73.png

— Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/1132#issuecomment-1702095120, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZC7TDRB56EJKTSG6DXYFHNHANCNFSM6AAAAAA4COGQNU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Sep 01 '23 03:09 danpovey

Silence is over 90%?? … On Fri, Sep 1, 2023, 11:15 AM ZhangJiang @.> wrote: Looks like not every training example has features extracted. Make sure you passed the path to the right cut set (with features). You can also check ‘lhotse cut describe ’ it will show you some stats about the data. okay, and here is the status of my cut_train.jsonl.gz [image: image] https://user-images.githubusercontent.com/32287808/264909381-b549ca50-76fa-4259-bec8-7c886e7a2e73.png — Reply to this email directly, view it on GitHub <#1132 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZC7TDRB56EJKTSG6DXYFHNHANCNFSM6AAAAAA4COGQNU . You are receiving this because you are subscribed to this thread.Message ID: @.>

... looks so weird ..., and I don't know what's wrong.

Sep 01 '23 03:09 SaltedSlark

Look at the jsonl file

On Friday, September 1, 2023, ZhangJiang @.***> wrote:

Silence is over 90%?? … <#m_-4835813782995112893_> On Fri, Sep 1, 2023, 11:15 AM ZhangJiang @.> wrote: Looks like not every training example has features extracted. Make sure you passed the path to the right cut set (with features). You can also check ‘lhotse cut describe ’ it will show you some stats about the data. okay, and here is the status of my cut_train.jsonl.gz [image: image] https://user-images.githubusercontent.com/32287808/264909381-b549ca50-76fa-4259-bec8-7c886e7a2e73.png https://user-images.githubusercontent.com/32287808/264909381-b549ca50-76fa-4259-bec8-7c886e7a2e73.png — Reply to this email directly, view it on GitHub <#1132 (comment) https://github.com/lhotse-speech/lhotse/issues/1132#issuecomment-1702095120>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZC7TDRB56EJKTSG6DXYFHNHANCNFSM6AAAAAA4COGQNU https://github.com/notifications/unsubscribe-auth/AAZFLOZC7TDRB56EJKTSG6DXYFHNHANCNFSM6AAAAAA4COGQNU . You are receiving this because you are subscribed to this thread.Message ID: @.>

... looks so weird ..., and I don't know what's wrong.

— Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/1132#issuecomment-1702108347, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO4EBN2UXG7FLFQBVCDXYFJVTANCNFSM6AAAAAA4COGQNU . You are receiving this because you commented.Message ID: @.***>

Sep 01 '23 04:09 danpovey

looks like features num is much smaller than cuts count? is that something wrong?and why it happend? I combine two sets to get the cut_train set and I found one of them has 0 feature...

Perhaps one of the cut sets you combined did not have features computed. Also, judging by the mean duration of 1600s, you did not call .trim_to_supervisions() on this cutset.

Sep 02 '23 14:09 pzelasko

thanks you so much!@pzelasko @danpovey I'll try.

Sep 04 '23 01:09 SaltedSlark

@pzelasko As for M subset, I am sure that I've called .trim_to_supervisions as I showed. I found the Supervisions available does not match with Feature available... and it seems to cause an validate mistake after call validate()

Sep 04 '23 06:09 SaltedSlark

@pzelasko As for M subset, I am sure that I've called .trim_to_supervisions as I showed. I found the Supervisions available does not match with Feature available... and it seems to cause an validate mistake after call validate()

Detailed description in this function mentioned that keep_overlapping would keep the number matched.

Result on S subset:

Sep 04 '23 11:09 Jiang-Stan

You either need to use keep_overlapping=False or filter out the cuts that have overlapping speech (whichever makes sense for your use case).

Sep 04 '23 12:09 pzelasko

@SaltedSlark Hi, how long did you take preprocessing WenetSpeech M set? It takes me 50 minutes extracting features, but it has taken over 11 hours saving to wenetspeech_cuts_M.jsonl.gz and still not finished yet.

@pzelasko Is there any parallelization optimization for this function? I tried to preprocess WenetSpeech M set last night, and it took over 11 hours on this function and still not finished(The progress bar time cost is 50 minutes before keyboard interrupt). I have successfully preprocessed WenetSpeech S set twice with same num_workers and the time for saving is negligible, so I guess this is not a lock issue. By applying htop, I find that only one CPU is used for saving.

Sep 05 '23 03:09 Jiang-Stan

@SaltedSlark Hi, how long did you take preprocessing WenetSpeech M set? It takes me 50 minutes extracting features, but it has taken over 8 hours saving to wenetspeech_cuts_M.jsonl.gz and still not finished yet.

@pzelasko Is there any parallelization optimization for this function? I tried to preprocess WenetSpeech M set last night, and it took over 8 hours on this function and still not finished. I have successfully preprocessed WenetSpeech S set twice with same num_workers, so I guess this is not a lock issue.

For me, it took about 80hours to process M subset... and I also want to know how to speed up!

Sep 05 '23 03:09 SaltedSlark

@SaltedSlark Hi, how long did you take preprocessing WenetSpeech M set? It takes me 50 minutes extracting features, but it has taken over 8 hours saving to wenetspeech_cuts_M.jsonl.gz and still not finished yet. @pzelasko Is there any parallelization optimization for this function? I tried to preprocess WenetSpeech M set last night, and it took over 8 hours on this function and still not finished. I have successfully preprocessed WenetSpeech S set twice with same num_workers, so I guess this is not a lock issue.

For me, it took about 80hours to process M subset... and I also want to know how to speed up!

I'll try again.

Sep 05 '23 03:09 SaltedSlark

I noticed that only one thread is set to save data from here. I tried to use 32 threads but it still cannot finish saving. @pzelasko

By separating recordings and annotations in manifest into small sets, I successfully generate wenetspeech_cuts_M_{i}.jsonl.gz(i=0~9) within an hour. Since recordings and supervisions is saved sequentially, it won't take too long time to match them. @SaltedSlark

Sep 05 '23 10:09 Jiang-Stan

I noticed that only one thread is set to save data from here. I tried to use 32 threads but it still cannot finish saving. @pzelasko

By separating recordings and annotations in manifest into small sets, I successfully generate wenetspeech_cuts_M_{i}.jsonl.gz(i=0~9) within an hour. Since recordings and supervisions is saved sequentially, it won't take too long time to match them. @SaltedSlark

Thanks! But I don't know how to separate recodings and supervisions in manifest, need your help, bro.

Sep 06 '23 01:09 SaltedSlark

I noticed that only one thread is set to save data from here. I tried to use 32 threads but it still cannot finish saving. @pzelasko By separating recordings and annotations in manifest into small sets, I successfully generate wenetspeech_cuts_M_{i}.jsonl.gz(i=0~9) within an hour. Since recordings and supervisions is saved sequentially, it won't take too long time to match them. @SaltedSlark

Thanks! But I don't know how to separate recodings and supervisions in manifest, need your help, bro.

manifests = read_manifests_if_cached(
        dataset_parts=dataset_parts,
        output_dir=args.src_dir,
        prefix=args.prefix,
        suffix=args.suffix,
        types=["recordings", "supervisions", "cuts"],
    )

    if args.prefix == "wenetspeech" and ("M" in manifests.keys() or "L" in manifests.keys()):
        from lhotse.audio import RecordingSet
        from lhotse.supervision import SupervisionSet
        separate_num = 10 if "M" in manifests.keys() else 100
        name = "M" if "M" in manifests.keys() else "L"
        origin_manifest = manifests.pop(name)
        recordings = [r for r in origin_manifest["recordings"]]
        supervisions = [s for s in origin_manifest["supervisions"]]
        start_idx = 0
        for i in tqdm(range(separate_num)):
            subset_name = name+str(i)
            end_idx = len(recordings)*(i+1)//separate_num
            cur_recordings = recordings[start_idx:end_idx]
            cur_supervisions = []
            for r in cur_recordings:
                match = True
                while match:
                    if len(supervisions)>0 and supervisions[0].recording_id == r.id:
                        cur_supervisions.append(supervisions.pop(0))
                    else:
                        match = False

            manifests[subset_name] = {
                "recordings": RecordingSet.from_recordings(cur_recordings),
                "supervisions": SupervisionSet.from_segments(cur_supervisions)
            }
            start_idx = end_idx
        assert len(supervisions) == 0

Sep 06 '23 02:09 Jiang-Stan

Some tips:

splitting cut/recording/supervision set into smaller parts can be done with parts = cuts.split(num_parts), e.g.:

In [4]: cuts
Out[4]: CutSet(len=1519) [underlying data type: <class 'lhotse.lazy.LazyManifestIterator'>]

In [8]: cuts.split(2)
Out[8]:
[CutSet(len=760) [underlying data type: <class 'dict'>],
 CutSet(len=759) [underlying data type: <class 'dict'>]]

cuts.compute_and_store_features_batch is bottlenecked by I/O in 99% of the use cases since feature extraction is usually much quicker than dataloading. Try to set the highest possible batch_duration first, and then keep increasing num_workers until you start seeing crashes, freezes, or slowdowns.
if you're computing features on CPUs or have multiple GPUs, it's generally a good idea to split a single large cut set into parts as was suggested earlier and run multiple scripts processing these parts in parallel; for CPU based computation generally prefer compute_and_store_features though as it supports in-built parallelization across CPUs (unlike the batch version)

Sep 07 '23 14:09 pzelasko

So is there possible to use on the fly in the function compute_and_store_features_batch ?

Apr 16 '24 07:04 OswaldoBornemann

I didn’t get your question, please elaborate.

Apr 17 '24 13:04 pzelasko

Sorry for my imcompleted asking. So my question is whether we can on-the-fly calculate the feature and not store them during the training process? Because in my case, I don't have such large GPU for the training.

Apr 18 '24 02:04 OswaldoBornemann

Yes, you can compute the features inside the PyTorch dataset class. See OnTheFlyFeatures or K2SpeechRecognitionDataset for some examples. You can also look up k2-fsa/icefall repo for recipes that support this.

Apr 18 '24 11:04 pzelasko

That's great. I will try to revise it. Thanks a lot.

Apr 19 '24 01:04 OswaldoBornemann

lhotse lhotse copied to clipboard

how much shared memory and disk memory do i need to process the S subset of wenetspeech dataset?

lhotse
lhotse copied to clipboard