NVTabular
NVTabular copied to clipboard
[BUG] Dataloader high GPU memory consumption
Describe the bug If I create a single batch with the NVTabular dataloader (PyTorch or TensorFlow), the GPU memory consumption is pretty high. In the case of PyTorch, a single batch for the criteo dataset requires to load 24GB GPU memory of 32GB V100. I think we need to investigate the default behavior for nvt.Dataset.
Steps/Code to reproduce bug Run criteo 2_ETL_with_NVTabular notebook to produce parquet file - one day is enough.
Run
import torch
import torch.nn as nn
from nvtabular.loader.torch import TorchAsyncItr, DLDataLoader
import nvtabular as nvt
CONTINUOUS_COLUMNS = ["I" + str(x) for x in range(1, 14)]
CATEGORICAL_COLUMNS = ["C" + str(x) for x in range(1, 27)]
LABEL_COLUMNS = ["label"]
train_loader = TorchAsyncItr(
nvt.Dataset([train_files[0]]),
batch_size=1024,
cats=CATEGORICAL_COLUMNS,
conts=CONTINUOUS_COLUMNS,
labels=LABEL_COLUMNS
)
batch = next(iter(train_loader))
Expected behavior I expect, that we do not require 75% of the GPU memory in the default behavior.
Additional context I can limit the GPU memory consumption by setting nvt.Dataset(, part_size=part_size) I ran multiple experiments: PartSize: GPU Memory ; Time 100MB: 2GB ; 25s 300MB: 2.2GB ; 14s 1000MB: 4GB ; 6.85s 3000MB: 10GB ; 5.2s 5000MB: 10GB ; 4.9s 10000MB: OOM No Part Size: 18GB; 4.62s
GPU memory is peak Time is to iterate over the 1 day of criteo without training a model
I think the default behavior of nvt.Dataset should be a better trade-off in GPU memory peak and time
I was able to reproduce the high memory usage locally, but I did not have a chance to compare with an earlier version of NVTabular to see if this is a regression (or if it has always been the case). As far as I can tell, we can expect the minimum-possible memory usage to be 2x the partition size (the size of the current and pre-fetched partition). However, we also know that the pytorch loader will likely double this memory usage by coverting to dlpack. Therefore, if you are asking for 4GB partitions, I would expect the minimum steady-state memory usage to be ~16GB.
@jperez999
I tested 22.02 pytorch-container with pulling the latest main branch.
- I initialize all variables (run the code above without
batch = next(iter(train_loader))
) - nvidia-smi shows1250MiB
- If I run batch = next(iter(train_loader)) - nvidia-smi shows
24946MiB / 32510MiB
.
The data loader consumes 75% of GPU memory.
I tried to run the same code on TensorFlow. However, I cannot execute batch = next(iter(train_dataset_tf))
for a NVTabular TensorFlow dataloader
@jperez999 @bschifferer What's the next step with this?