dataloader icon indicating copy to clipboard operation
dataloader copied to clipboard

Shuffle doesn't work

Open Ilyushin opened this issue 3 years ago • 3 comments

Hi all!

Below an examle of code:

from merlin.loader.torch import Loader
from merlin.io import Dataset


train_ds = Dataset('train.parquet')
train_loader = Loader(train_ds, batch_size=65536, shuffle=True)

for batch in train_loader:
    print(batch)

After running I got following:

TypeError: sample() got an unexpected keyword argument 'keep_index'

Ilyushin avatar Dec 15 '22 10:12 Ilyushin

@Ilyushin Thanks for reporting the issue. Can you provide more details so we can reproduce the issue on our end?

  • Did you use our merlin containers, e.g., nvcr.io/nvidia/merlin/merlin-pytorch:22.11 or install it with conda or pip?
  • What are the package versions that you see when you run python -c 'import merlin.core; print(merlin.core.__version__)' and python -c 'import merlin.dataloader; print(merlin.dataloader.__version__)'
  • It it possible to provide us with the dataset schema train_ds.schema?

edknv avatar Dec 15 '22 19:12 edknv

@edknv Thank you for helping.

  • I have used nvcr.io/nvidia/pytorch:22.06-py3
  • I tried to use 0.0.2 and 0.0.3
  • I downloaded this dataset - https://www.kaggle.com/code/radek1/howto-full-dataset-as-parquet-csv-files

Ilyushin avatar Dec 23 '22 16:12 Ilyushin

This seems to be due to the version of cudf in the nvcr.io/nvidia/pytorch:22.06-py3 container. In the older version of cudf (prior to 22.04), the keep_index parameter was not available in df.sample().

@Ilyushin Is upgrading your container an option? (e.g., to nvcr.io/nvidia/pytorch:22.07-py3 or even the latest 22.12-py3 not 22.06.) Please also note that nvcr.io/nvidia/merlin/merlin-pytorch comes with merlin-dataloader pre-installed so you don't have to install merlin-dataloader.

edknv avatar Jan 18 '23 20:01 edknv