text icon indicating copy to clipboard operation
text copied to clipboard

Does DataLoader(shuffle=True) really shuffle DBpedia dataset correctly?

Open fujidaiti opened this issue 1 year ago • 0 comments

According to the docs, DBpedia dataset has 14 classes (labels) and 40000 texts for each class. Hence, if I create batches using DataLoader(shuffle=True) as follows:

import torchtext.datasets as d
from torch.utils.data.dataloader import DataLoader

train = DataLoader(
    d.DBpedia(split="train", root=".cache"),
    batch_size=10000,
    shuffle=True,
)

the labels should be uniformly distributed in each batch. But in practice, it seems that only a few labels are in each batch.

for labels, texts in train:
    print(len(set(labels.tolist())))

The output of the above code is:

1
1
1
2
2
2
2
3
3
3
3
4
4
3
3
.
.
.

How can I fix this? Or is my implementation wrong?

P.S. Interactive code is available on GoogleColab

fujidaiti avatar Aug 04 '23 10:08 fujidaiti