fairseq Unable to find steps for creating .tsv file

❓ Questions and Help

[HuBERT] Unable to find steps for creating .tsv file

Code: https://github.com/facebookresearch/fairseq/tree/main/examples/hubert/simple_kmeans

What have you tried?

I am trying to follow the process in the simple_kmeans folder to extract 39-D mfcc+delta+ddelta features for the 1st iteration HUBERT training. I can't find the steps for generating a .tsv file.

I have manually created the .tsv files. Here's how my train.tsv looks like:

audio_files/train/ LJ025-0076.wav LJ037-0171.wav LJ001-0001.wav LJ001-0002.wav LJ001-0003.wav LJ001-0004.wav LJ001-0005.wav

But when I try to run the following command: python dump_mfcc_feature.py ${tsv_dir} ${split} ${nshard} ${rank} ${feat_dir}

This is the error I get:

File "C:\Users\HP\Desktop\iitdh2\hubert_s2ut\feature_utils.py", line 44, in iterate subpath, nsample = line.split("\t") ValueError: not enough values to unpack (expected 2, got 1)

I'd also like to know where I can find a sample .tsv as well as a .ltr file.

What's your environment?

fairseq Version 2.0:
OS : Windows 10
How you installed fairseq: source
Python version: 3.9.1

Thanks.

Nov 03 '22 14:11 yashrivastava

Could you let me know if you fixed this error successfully? I'm still managing to fix this :(

Mar 02 '23 18:03 tarudesu

have you tried to use wav2vec_manifest script ? you can find it at fairseq/examples/wav2vec/wav2vec_manifest.py . it worked for me

Mar 21 '23 08:03 renadnasser1

it due to the tsv file missed the nsample column, the instruction is misleading, you can put the nsample into the tsv file to fix it.

Oct 16 '24 11:10 indiejoseph

The nsample variable extracted from the TSV files is primarily used for logging purposes. It is passed as the input parameter ref_len to the function read_audio(): https://github.com/facebookresearch/fairseq/blob/ecbf110e1eb43861214b05fa001eff584954f65a/examples/hubert/simple_kmeans/dump_mfcc_feature.py#L30-L34 A warning will be triggered if nsample does not match the length of the WAV array. Additionally, the nsample column in your TSV file will also be utilized in fairseq_cli/hydra_train.py. If this discrepancy is not a concern for you, it is perfectly fine to set the nsample variable to None. To do this, modify the two lines of code in the loop found here: https://github.com/facebookresearch/fairseq/blob/ecbf110e1eb43861214b05fa001eff584954f65a/examples/hubert/simple_kmeans/feature_utils.py#L41-L44 to the following:

def iterate():
    for line in lines:
        subpath = line.strip()
        yield f"{root}/{subpath}", None

While this method will allow dump_mfcc_feature.py to work as intended, you may still encounter issues with fairseq_cli/hydra_train.py. If you plan to use fairseq_cli/hydra_train.py to train a HuBERT model, I recommend that you do not modify the code in dump_mfcc_feature.py as suggested above. Instead, add a column to your TSV file in the following format:

<root-dir>
<audio-path-1>\t<length-of-audio-array-1>
<audio-path-2>\t<length-of-audio-array-2>
...

To create a TSV file like this, you can use the following Python code:

from datasets import load_dataset

def func(tsv_path, root_dir):
    data = load_dataset('openslr/librispeech_asr', name='clean', split='test')
    with open(tsv_path, 'w', encoding='utf-8') as f:
        f.write(root_dir + '\n')
        for sample in data:
            audio = sample['audio']
            audio_path = audio['path']
            relative_audio_path = audio_path.replace(root_dir, '')
            n_sample = len(audio['array'])
            f.write(relative_audio_path + '\t' + str(n_sample) + '\n')

In this code, tsv_path is the path where you want to save your TSV file, and root_dir is the first line of the TSV file, as described in the instruction.

I hope this helps!

Jan 01 '25 08:01 jingfanke