Shuffle doesn't work
Hi all!
Below an examle of code:
from merlin.loader.torch import Loader
from merlin.io import Dataset
train_ds = Dataset('train.parquet')
train_loader = Loader(train_ds, batch_size=65536, shuffle=True)
for batch in train_loader:
print(batch)
After running I got following:
TypeError: sample() got an unexpected keyword argument 'keep_index'
@Ilyushin Thanks for reporting the issue. Can you provide more details so we can reproduce the issue on our end?
- Did you use our merlin containers, e.g.,
nvcr.io/nvidia/merlin/merlin-pytorch:22.11or install it with conda or pip? - What are the package versions that you see when you run
python -c 'import merlin.core; print(merlin.core.__version__)'andpython -c 'import merlin.dataloader; print(merlin.dataloader.__version__)' - It it possible to provide us with the dataset schema
train_ds.schema?
@edknv Thank you for helping.
- I have used nvcr.io/nvidia/pytorch:22.06-py3
- I tried to use 0.0.2 and 0.0.3
- I downloaded this dataset - https://www.kaggle.com/code/radek1/howto-full-dataset-as-parquet-csv-files
This seems to be due to the version of cudf in the nvcr.io/nvidia/pytorch:22.06-py3 container. In the older version of cudf (prior to 22.04), the keep_index parameter was not available in df.sample().
@Ilyushin Is upgrading your container an option? (e.g., to nvcr.io/nvidia/pytorch:22.07-py3 or even the latest 22.12-py3 not 22.06.) Please also note that nvcr.io/nvidia/merlin/merlin-pytorch comes with merlin-dataloader pre-installed so you don't have to install merlin-dataloader.