lhotse icon indicating copy to clipboard operation
lhotse copied to clipboard

how much shared memory and disk memory do i need to process the S subset of wenetspeech dataset?

Open SaltedSlark opened this issue 1 year ago • 29 comments

insufficient shm image insufficent disk mem? image here is my docker info: image

SaltedSlark avatar Aug 29 '23 06:08 SaltedSlark

I’m guessing this is related to IPC of data loading workers for batch feat computation and could be related to too many workers/too large batches; but judging by the warning about max_duration, did you trim your cut set to supervisions? Can you show the output “lhotse cut describe cuts.jsonl.gz”? I think you might be computing features for very long cuts (and you probably don’t need this).

pzelasko avatar Aug 29 '23 11:08 pzelasko

I’m guessing this is related to IPC of data loading workers for batch feat computation and could be related to too many workers/too large batches; but judging by the warning about max_duration, did you trim your cut set to supervisions? Can you show the output “lhotse cut describe cuts.jsonl.gz”? I think you might be computing features for very long cuts (and you probably don’t need this).

Thanks for ur reply! I revised the num_workers to 0, and this happened:

/bin/bash: /home/zj/anaconda3/envs/vall-e/lib/libtinfo.so.6: no version information available (required by /bin/bash)
2023-08-30 10:26:50 (prepare.sh:59:main) Stage 1: Prepare wenetspeech manifest
2023-08-30 10:26:50 (prepare.sh:71:main) Stage 2: Tokenize/Fbank wenetspeech
2023-08-30 10:27:06,501 INFO [tokenizer.py:160] dataset_parts: ['S'] manifests {'S': {'recordings': RecordingSet(len=43664), 'supervisions': SupervisionSet(len=151600)}}
2023-08-30 10:27:06,507 INFO [tokenizer.py:167] Processing partition: S CUDA: True
Computing features in batches:   0%|                                                      | 0/43664 [00:00<?, ?it/s]/home/zj/workspace/TTS/lhotse/lhotse/dataset/sampling/simple.py:216: UserWarning: The first cut drawn in batch collection violates the max_frames, max_cuts, or max_duration constraints - we'll return it anyway. Consider increasing max_frames/max_cuts/max_duration.
  warnings.warn(
Computing features in batches:   0%|                                                      | 0/43664 [00:14<?, ?it/s]
Traceback (most recent call last):
  File "/home/zj/workspace/TTS/vall-e/egs/wenetspeech/bin/tokenizer.py", line 268, in <module>
    main()
  File "/home/zj/workspace/TTS/vall-e/egs/wenetspeech/bin/tokenizer.py", line 204, in main
    cut_set = cut_set.compute_and_store_features_batch(
  File "/home/zj/workspace/TTS/lhotse/lhotse/cut/set.py", line 2308, in compute_and_store_features_batch
    features = extractor.extract_batch(
  File "/home/zj/workspace/TTS/vall-e/valle/data/tokenizer.py", line 348, in extract_batch
    encoded_frames = self.tokenizer.encode(samples.detach().to(device))
  File "/home/zj/workspace/TTS/vall-e/valle/data/tokenizer.py", line 239, in encode
    return self.codec.encode(wav.to(self.device))
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/encodec/model.py", line 144, in encode
    encoded_frames.append(self._encode_frame(frame))
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/encodec/model.py", line 161, in _encode_frame
    emb = self.encoder(x)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/encodec/modules/seanet.py", line 144, in forward
    return self.model(x)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/torch/nn/modules/container.py", line 204, in forward
    input = module(input)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/encodec/modules/seanet.py", line 63, in forward
    return self.shortcut(x) + self.block(x)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/torch/nn/modules/container.py", line 204, in forward
    input = module(input)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/encodec/modules/conv.py", line 204, in forward
    x = pad1d(x, (padding_total, extra_padding), mode=self.pad_mode)
  File "/home/zj/anaconda3/envs/vall-e/lib/python3.10/site-packages/encodec/modules/conv.py", line 92, in pad1d
    padded = F.pad(x, paddings, mode, value)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.14 GiB (GPU 0; 23.65 GiB total capacity; 21.73 GiB already allocated; 104.06 MiB free; 21.73 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

does it mean that the recodingset or Supervisionset is too long for my gpu devices(RTX 4090 24GB)? and what should i do to avoid this?

SaltedSlark avatar Aug 30 '23 02:08 SaltedSlark

Try cuts = cuts.trim_to_supervisions() before feature extraction and then you can also use multiple workers again.

pzelasko avatar Aug 30 '23 02:08 pzelasko

Try cuts = cuts.trim_to_supervisions() before feature extraction and then you can also use multiple workers again.

thanks! like this? before: image after: image

SaltedSlark avatar Aug 30 '23 02:08 SaltedSlark

Yeah

pzelasko avatar Aug 30 '23 13:08 pzelasko

Yeah

thanks! I met another problem when I try to train my vall-e model on S subset: image I have no idea what is wrong, looking for your rely, much love!

SaltedSlark avatar Sep 01 '23 02:09 SaltedSlark

Looks like not every training example has features extracted. Make sure you passed the path to the right cut set (with features). You can also check ‘lhotse cut describe ’ it will show you some stats about the data.

pzelasko avatar Sep 01 '23 03:09 pzelasko

Looks like not every training example has features extracted. Make sure you passed the path to the right cut set (with features). You can also check ‘lhotse cut describe ’ it will show you some stats about the data. okay, and here is the status of my cut_train.jsonl.gz image looks like features num is much smaller than cuts count? is that something wrong?and why it happend?

SaltedSlark avatar Sep 01 '23 03:09 SaltedSlark

Looks like not every training example has features extracted. Make sure you passed the path to the right cut set (with features). You can also check ‘lhotse cut describe ’ it will show you some stats about the data. okay, and here is the status of my cut_train.jsonl.gz image looks like features num is much smaller than cuts count? is that something wrong?and why it happend? I combine two sets to get the cut_train set and I found one of them has 0 feature... image

SaltedSlark avatar Sep 01 '23 03:09 SaltedSlark

Silence is over 90%??

On Fri, Sep 1, 2023, 11:15 AM ZhangJiang @.***> wrote:

Looks like not every training example has features extracted. Make sure you passed the path to the right cut set (with features). You can also check ‘lhotse cut describe ’ it will show you some stats about the data. okay, and here is the status of my cut_train.jsonl.gz [image: image] https://user-images.githubusercontent.com/32287808/264909381-b549ca50-76fa-4259-bec8-7c886e7a2e73.png

— Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/1132#issuecomment-1702095120, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZC7TDRB56EJKTSG6DXYFHNHANCNFSM6AAAAAA4COGQNU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

danpovey avatar Sep 01 '23 03:09 danpovey

Silence is over 90%?? On Fri, Sep 1, 2023, 11:15 AM ZhangJiang @.> wrote: Looks like not every training example has features extracted. Make sure you passed the path to the right cut set (with features). You can also check ‘lhotse cut describe ’ it will show you some stats about the data. okay, and here is the status of my cut_train.jsonl.gz [image: image] https://user-images.githubusercontent.com/32287808/264909381-b549ca50-76fa-4259-bec8-7c886e7a2e73.png — Reply to this email directly, view it on GitHub <#1132 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZC7TDRB56EJKTSG6DXYFHNHANCNFSM6AAAAAA4COGQNU . You are receiving this because you are subscribed to this thread.Message ID: @.>

... looks so weird ..., and I don't know what's wrong.

SaltedSlark avatar Sep 01 '23 03:09 SaltedSlark

Look at the jsonl file

On Friday, September 1, 2023, ZhangJiang @.***> wrote:

Silence is over 90%?? … <#m_-4835813782995112893_> On Fri, Sep 1, 2023, 11:15 AM ZhangJiang @.> wrote: Looks like not every training example has features extracted. Make sure you passed the path to the right cut set (with features). You can also check ‘lhotse cut describe ’ it will show you some stats about the data. okay, and here is the status of my cut_train.jsonl.gz [image: image] https://user-images.githubusercontent.com/32287808/264909381-b549ca50-76fa-4259-bec8-7c886e7a2e73.png https://user-images.githubusercontent.com/32287808/264909381-b549ca50-76fa-4259-bec8-7c886e7a2e73.png — Reply to this email directly, view it on GitHub <#1132 (comment) https://github.com/lhotse-speech/lhotse/issues/1132#issuecomment-1702095120>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZC7TDRB56EJKTSG6DXYFHNHANCNFSM6AAAAAA4COGQNU https://github.com/notifications/unsubscribe-auth/AAZFLOZC7TDRB56EJKTSG6DXYFHNHANCNFSM6AAAAAA4COGQNU . You are receiving this because you are subscribed to this thread.Message ID: @.>

... looks so weird ..., and I don't know what's wrong.

— Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/1132#issuecomment-1702108347, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO4EBN2UXG7FLFQBVCDXYFJVTANCNFSM6AAAAAA4COGQNU . You are receiving this because you commented.Message ID: @.***>

danpovey avatar Sep 01 '23 04:09 danpovey

looks like features num is much smaller than cuts count? is that something wrong?and why it happend? I combine two sets to get the cut_train set and I found one of them has 0 feature...

Perhaps one of the cut sets you combined did not have features computed. Also, judging by the mean duration of 1600s, you did not call .trim_to_supervisions() on this cutset.

pzelasko avatar Sep 02 '23 14:09 pzelasko

thanks you so much!@pzelasko @danpovey I'll try.

SaltedSlark avatar Sep 04 '23 01:09 SaltedSlark

@pzelasko As for M subset, I am sure that I've called .trim_to_supervisions as I showed. I found the Supervisions available does not match with Feature available... image and it seems to cause an validate mistake after call validate() image

SaltedSlark avatar Sep 04 '23 06:09 SaltedSlark

@pzelasko As for M subset, I am sure that I've called .trim_to_supervisions as I showed. I found the Supervisions available does not match with Feature available... image and it seems to cause an validate mistake after call validate() image

image Detailed description in this function mentioned that keep_overlapping would keep the number matched.

Result on S subset: image

Jiang-Stan avatar Sep 04 '23 11:09 Jiang-Stan

You either need to use keep_overlapping=False or filter out the cuts that have overlapping speech (whichever makes sense for your use case).

pzelasko avatar Sep 04 '23 12:09 pzelasko

@SaltedSlark Hi, how long did you take preprocessing WenetSpeech M set? It takes me 50 minutes extracting features, but it has taken over 11 hours saving to wenetspeech_cuts_M.jsonl.gz and still not finished yet.

@pzelasko Is there any parallelization optimization for this function? I tried to preprocess WenetSpeech M set last night, and it took over 11 hours on this function and still not finished(The progress bar time cost is 50 minutes before keyboard interrupt). I have successfully preprocessed WenetSpeech S set twice with same num_workers and the time for saving is negligible, so I guess this is not a lock issue. image By applying htop, I find that only one CPU is used for saving. image

Jiang-Stan avatar Sep 05 '23 03:09 Jiang-Stan

@SaltedSlark Hi, how long did you take preprocessing WenetSpeech M set? It takes me 50 minutes extracting features, but it has taken over 8 hours saving to wenetspeech_cuts_M.jsonl.gz and still not finished yet.

@pzelasko Is there any parallelization optimization for this function? I tried to preprocess WenetSpeech M set last night, and it took over 8 hours on this function and still not finished. I have successfully preprocessed WenetSpeech S set twice with same num_workers, so I guess this is not a lock issue. image

For me, it took about 80hours to process M subset... and I also want to know how to speed up!

SaltedSlark avatar Sep 05 '23 03:09 SaltedSlark

@SaltedSlark Hi, how long did you take preprocessing WenetSpeech M set? It takes me 50 minutes extracting features, but it has taken over 8 hours saving to wenetspeech_cuts_M.jsonl.gz and still not finished yet. @pzelasko Is there any parallelization optimization for this function? I tried to preprocess WenetSpeech M set last night, and it took over 8 hours on this function and still not finished. I have successfully preprocessed WenetSpeech S set twice with same num_workers, so I guess this is not a lock issue. image

For me, it took about 80hours to process M subset... and I also want to know how to speed up!

I'll try again.

SaltedSlark avatar Sep 05 '23 03:09 SaltedSlark

I noticed that only one thread is set to save data from here. I tried to use 32 threads but it still cannot finish saving. @pzelasko

By separating recordings and annotations in manifest into small sets, I successfully generate wenetspeech_cuts_M_{i}.jsonl.gz(i=0~9) within an hour. Since recordings and supervisions is saved sequentially, it won't take too long time to match them. @SaltedSlark

Jiang-Stan avatar Sep 05 '23 10:09 Jiang-Stan

I noticed that only one thread is set to save data from here. I tried to use 32 threads but it still cannot finish saving. @pzelasko

By separating recordings and annotations in manifest into small sets, I successfully generate wenetspeech_cuts_M_{i}.jsonl.gz(i=0~9) within an hour. Since recordings and supervisions is saved sequentially, it won't take too long time to match them. @SaltedSlark

Thanks! But I don't know how to separate recodings and supervisions in manifest, need your help, bro.

SaltedSlark avatar Sep 06 '23 01:09 SaltedSlark

I noticed that only one thread is set to save data from here. I tried to use 32 threads but it still cannot finish saving. @pzelasko By separating recordings and annotations in manifest into small sets, I successfully generate wenetspeech_cuts_M_{i}.jsonl.gz(i=0~9) within an hour. Since recordings and supervisions is saved sequentially, it won't take too long time to match them. @SaltedSlark

Thanks! But I don't know how to separate recodings and supervisions in manifest, need your help, bro.

manifests = read_manifests_if_cached(
        dataset_parts=dataset_parts,
        output_dir=args.src_dir,
        prefix=args.prefix,
        suffix=args.suffix,
        types=["recordings", "supervisions", "cuts"],
    )

    if args.prefix == "wenetspeech" and ("M" in manifests.keys() or "L" in manifests.keys()):
        from lhotse.audio import RecordingSet
        from lhotse.supervision import SupervisionSet
        separate_num = 10 if "M" in manifests.keys() else 100
        name = "M" if "M" in manifests.keys() else "L"
        origin_manifest = manifests.pop(name)
        recordings = [r for r in origin_manifest["recordings"]]
        supervisions = [s for s in origin_manifest["supervisions"]]
        start_idx = 0
        for i in tqdm(range(separate_num)):
            subset_name = name+str(i)
            end_idx = len(recordings)*(i+1)//separate_num
            cur_recordings = recordings[start_idx:end_idx]
            cur_supervisions = []
            for r in cur_recordings:
                match = True
                while match:
                    if len(supervisions)>0 and supervisions[0].recording_id == r.id:
                        cur_supervisions.append(supervisions.pop(0))
                    else:
                        match = False

            manifests[subset_name] = {
                "recordings": RecordingSet.from_recordings(cur_recordings),
                "supervisions": SupervisionSet.from_segments(cur_supervisions)
            }
            start_idx = end_idx
        assert len(supervisions) == 0

Jiang-Stan avatar Sep 06 '23 02:09 Jiang-Stan

Some tips:

  • splitting cut/recording/supervision set into smaller parts can be done with parts = cuts.split(num_parts), e.g.:
In [4]: cuts
Out[4]: CutSet(len=1519) [underlying data type: <class 'lhotse.lazy.LazyManifestIterator'>]

In [8]: cuts.split(2)
Out[8]:
[CutSet(len=760) [underlying data type: <class 'dict'>],
 CutSet(len=759) [underlying data type: <class 'dict'>]]
  • cuts.compute_and_store_features_batch is bottlenecked by I/O in 99% of the use cases since feature extraction is usually much quicker than dataloading. Try to set the highest possible batch_duration first, and then keep increasing num_workers until you start seeing crashes, freezes, or slowdowns.
  • if you're computing features on CPUs or have multiple GPUs, it's generally a good idea to split a single large cut set into parts as was suggested earlier and run multiple scripts processing these parts in parallel; for CPU based computation generally prefer compute_and_store_features though as it supports in-built parallelization across CPUs (unlike the batch version)

pzelasko avatar Sep 07 '23 14:09 pzelasko

So is there possible to use on the fly in the function compute_and_store_features_batch ?

OswaldoBornemann avatar Apr 16 '24 07:04 OswaldoBornemann

I didn’t get your question, please elaborate.

pzelasko avatar Apr 17 '24 13:04 pzelasko

Sorry for my imcompleted asking. So my question is whether we can on-the-fly calculate the feature and not store them during the training process? Because in my case, I don't have such large GPU for the training.

OswaldoBornemann avatar Apr 18 '24 02:04 OswaldoBornemann

Yes, you can compute the features inside the PyTorch dataset class. See OnTheFlyFeatures or K2SpeechRecognitionDataset for some examples. You can also look up k2-fsa/icefall repo for recipes that support this.

pzelasko avatar Apr 18 '24 11:04 pzelasko

That's great. I will try to revise it. Thanks a lot.

OswaldoBornemann avatar Apr 19 '24 01:04 OswaldoBornemann