text
text copied to clipboard
Does DataLoader(shuffle=True) really shuffle DBpedia dataset correctly?
According to the docs, DBpedia dataset has 14 classes (labels) and 40000 texts for each class. Hence, if I create batches using DataLoader(shuffle=True)
as follows:
import torchtext.datasets as d
from torch.utils.data.dataloader import DataLoader
train = DataLoader(
d.DBpedia(split="train", root=".cache"),
batch_size=10000,
shuffle=True,
)
the labels should be uniformly distributed in each batch. But in practice, it seems that only a few labels are in each batch.
for labels, texts in train:
print(len(set(labels.tolist())))
The output of the above code is:
1
1
1
2
2
2
2
3
3
3
3
4
4
3
3
.
.
.
How can I fix this? Or is my implementation wrong?
P.S. Interactive code is available on GoogleColab