dataloaders icon indicating copy to clipboard operation
dataloaders copied to clipboard

about hdf5

Open shawnthu opened this issue 5 years ago • 3 comments

I have several hundred GB wav files on my disk (about 1, 000 hours wav data). I found directly reading the wav file is slow for training, so I choose lmdb and hdf5 as options. However I found that hdf5 do not support concurrent, i.e. num_workers in Dataloader can not be more than 1, how do you solve this problem? thx

shawnthu avatar May 13 '19 06:05 shawnthu

https://pytorch.org/audio/datasets.html#yesno torchaudio lists two examples, but the datasets the use are very small, so can be loaded into memory directly. This method does not fit for large data which can not fit into memory!

shawnthu avatar May 13 '19 07:05 shawnthu

Besides, I found a absurd phenomenon. e.g. all my wav files are under /wav folder. first I have a read wav function like this, from scipy.io import wavfile def read_wav(wav_path): rate, data wavfile(wav_path) return data

shawnthu avatar May 13 '19 07:05 shawnthu

Besides, I found a absurd phenomenon. e.g. all my wav files are under /wav folder. first I have a read wav function like this,

from scipy.io import wavfile
from torch.utils.data import Dataset, DataLoader
def read_wav(wav_path):
    rate, data wavfile(wav_path)
    return data

class Dst(Dataset):
    def __init__(self, wav_path_list):
        self.wav_path_list = wav_path_list
    def __len__(self):
        return len(self.wav_path_list)
    def __getitem(self, idx):
        return read_wav(self.wav_path_list[idx])

dst = Dst(wav_path_list)
loader = DataLoader(dst, batch_size, shuffle, num_workers)

in fact, when I set the num_workers from 0 to 4 (My work station equipped with 8 cpus), the speed do not change! It looks like the read_wav function occupies all the cpu cores, which result the failure of num_workers

shawnthu avatar May 13 '19 07:05 shawnthu