models [BUG] Update MovieLens Dataset Defaults to handle memory resources

Bug description

Various examples in the codebase use get_movielens without any parameters. Using this on a GPU (e.g. RTX A3000 with 6GB memory) results in a MemoryError. (std::bad_alloc: out_of_memory: CUDA error)

Steps/Code to reproduce bug

Import and run the get_movielens without params on a machine with a relatively low-memory GPU (e.g. 6GB).

from merlin.datasets.entertainment import get_movielens
train, valid = get_movielens()

Expected behavior

I would expect the dataset to be batched and not try to load all into GPU memory at once. If this is the only option. Perhaps the 25m size dataset should not be the default variant.

Environment details

Merlin version: 0.4.0+10.gfd1215e5
Platform: Ubuntu-WSL
Python version: 3.9
Tensorflow version: 2.8.0
GPU: NVIDIA RTX A3000 Laptop GPU

Additional context

May 13 '22 12:05 oliverholworthy

~~We should probably be returning a merlin.io.Dataset object for these functions , instead of a dataframe - this will let us partition the data into smaller chunks here.~~

May 13 '22 16:05 benfred

We do, see source. The issue might be that TF is loaded and therefore takes up 50% GPU memory. It sounded like Oliver has a GPU with 6GB of memory,, which might not be enough using the default parameters. Maybe we should use some different defaults when we detect that there’s not that much available memory?

May 13 '22 16:05 marcromeyn

@rnyak to set the default to 1M if it has same features with the 25M.

Oct 03 '22 15:10 rnyak