text Does DataLoader(shuffle=True) really shuffle DBpedia dataset correctly?

Does DataLoader(shuffle=True) really shuffle DBpedia dataset correctly?

Open fujidaiti opened this issue 2 years ago • 0 comments

According to the docs, DBpedia dataset has 14 classes (labels) and 40000 texts for each class. Hence, if I create batches using DataLoader(shuffle=True) as follows:

import torchtext.datasets as d
from torch.utils.data.dataloader import DataLoader

train = DataLoader(
    d.DBpedia(split="train", root=".cache"),
    batch_size=10000,
    shuffle=True,
)

the labels should be uniformly distributed in each batch. But in practice, it seems that only a few labels are in each batch.

for labels, texts in train:
    print(len(set(labels.tolist())))

The output of the above code is:

How can I fix this? Or is my implementation wrong?

P.S. Interactive code is available on GoogleColab

Aug 04 '23 10:08 fujidaiti

text text copied to clipboard

Does DataLoader(shuffle=True) really shuffle DBpedia dataset correctly?

text
text copied to clipboard