fairseq2 icon indicating copy to clipboard operation
fairseq2 copied to clipboard

Introduce wav2vec2 pretraining.

Open kauterry opened this issue 1 year ago • 3 comments

What does this PR do? Please describe: To run pretraining:

fairseq2 wav2vec2 train /checkpoint/$USER/wav2vec2_train

Run this code snippet for testing the dataloader:

import torch

from fairseq2.datasets.speech import load_speech_dataset
from fairseq2.gang import setup_default_gang

dataset = load_speech_dataset("librispeech_960h")

gang = setup_default_gang()

data_reader = dataset.create_reader(
    "train",
    gang,
    dtype=torch.float16,
    min_audio_len=32000,
    max_audio_len=250000,
    max_num_elements=1400000,
    normalize_audio=False,
    example_shuffle_window=1000,
    batch_shuffle_window=1000,
    num_accumulate=1,
    num_prefetch=4,
    seed=2,
)

while True:
    try:
        batch = next(data_reader)
        print(batch[0].shape)
    except StopIteration:
        print("End of data reached.")
        break

Output: torch.Size([5, 240320]) torch.Size([5, 242000]) torch.Size([5, 245840]) torch.Size([5, 235040]) torch.Size([5, 237360]) torch.Size([6, 200720]) torch.Size([8, 140800]) torch.Size([5, 234640]) torch.Size([6, 218720]) torch.Size([5, 239840]) torch.Size([8, 128240]) torch.Size([24, 56320]) torch.Size([7, 176561]) torch.Size([5, 233680]) torch.Size([8, 128240]) torch.Size([8, 155920]) torch.Size([5, 233360])

Does your PR introduce any breaking changes? If yes, please list them: List of all backwards-incompatible changes.

Check list:

  • [ ] Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
  • [ ] Did you read the contributor guideline?
  • [ ] Did you make sure that your PR does only one thing instead of bundling different changes together?
  • [ ] Did you make sure to update the documentation with your changes? (if necessary)
  • [ ] Did you write any new necessary tests?
  • [ ] Did you verify new and existing tests pass locally with your changes?
  • [ ] Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

kauterry avatar Jul 03 '24 00:07 kauterry

Hi, thanks! General question about the dataloader: This assumes that all files are saved as individual audio files in the file system. Is this approach scalable for future use, especially as pre-training data requirements grow? Are there any plans to support other storage solutions, such as tar or parquet?

orena1 avatar Jul 05 '24 15:07 orena1

Hi, thanks! General question about the dataloader: This assumes that all files are saved as individual audio files in the file system. Is this approach scalable for future use, especially as pre-training data requirements grow? Are there any plans to support other storage solutions, such as tar or parquet?

Hi @orena1, you are totally right. This PR is mainly intended for the parity work with the original fairseq implementation. For better scalability, we internally use two different techniques which we will eventually upstream to this recipe as well. The first technique uses non-compressed zip files to bundle multiple audio files together. We already have support for reading such files using FileMapper (see here). The other (more recent) technique is to leverage Parquet files instead of plain TSV files. We also have preliminary support for reading Parquet files in fairseq2 which we plan to extend in the near future. Unfortunately our documentation right now is almost non-existent, but starting next week we plan to invest a lot more time writing proper documentation.

cbalioglu avatar Jul 05 '24 23:07 cbalioglu

Hi, thanks! General question about the dataloader: This assumes that all files are saved as individual audio files in the file system. Is this approach scalable for future use, especially as pre-training data requirements grow? Are there any plans to support other storage solutions, such as tar or parquet?

Hi @orena1, you are totally right. This PR is mainly intended for the parity work with the original fairseq implementation. For better scalability, we internally use two different techniques which we will eventually upstream to this recipe as well. The first technique uses non-compressed zip files to bundle multiple audio files together. We already have support for reading such files using FileMapper (see here). The other (more recent) technique is to leverage Parquet files instead of plain TSV files. We also have preliminary support for reading Parquet files in fairseq2 which we plan to extend in the near future. Unfortunately our documentation right now is almost non-existent, but starting next week we plan to invest a lot more time writing proper documentation.

Thanks a lot! happy to see how it progress, and how exactly you handle random access with zip, thanks 🙏

orena1 avatar Jul 07 '24 17:07 orena1

Let's keep the branch around for further data loading experiments, but closing this PR. Please continue doing further development on team repo.

cbalioglu avatar Aug 27 '24 11:08 cbalioglu