NVTabular icon indicating copy to clipboard operation
NVTabular copied to clipboard

[BUG] Dataloader high GPU memory consumption

Open bschifferer opened this issue 3 years ago • 4 comments

Describe the bug If I create a single batch with the NVTabular dataloader (PyTorch or TensorFlow), the GPU memory consumption is pretty high. In the case of PyTorch, a single batch for the criteo dataset requires to load 24GB GPU memory of 32GB V100. I think we need to investigate the default behavior for nvt.Dataset.

Steps/Code to reproduce bug Run criteo 2_ETL_with_NVTabular notebook to produce parquet file - one day is enough.

Run

import torch
import torch.nn as nn

from nvtabular.loader.torch import TorchAsyncItr, DLDataLoader

import nvtabular as nvt

CONTINUOUS_COLUMNS = ["I" + str(x) for x in range(1, 14)]
CATEGORICAL_COLUMNS = ["C" + str(x) for x in range(1, 27)]
LABEL_COLUMNS = ["label"]

train_loader = TorchAsyncItr(
    nvt.Dataset([train_files[0]]), 
    batch_size=1024, 
    cats=CATEGORICAL_COLUMNS, 
    conts=CONTINUOUS_COLUMNS, 
    labels=LABEL_COLUMNS
)

batch = next(iter(train_loader))

Expected behavior I expect, that we do not require 75% of the GPU memory in the default behavior.

Additional context I can limit the GPU memory consumption by setting nvt.Dataset(, part_size=part_size) I ran multiple experiments: PartSize: GPU Memory ; Time 100MB: 2GB ; 25s 300MB: 2.2GB ; 14s 1000MB: 4GB ; 6.85s 3000MB: 10GB ; 5.2s 5000MB: 10GB ; 4.9s 10000MB: OOM No Part Size: 18GB; 4.62s

GPU memory is peak Time is to iterate over the 1 day of criteo without training a model

I think the default behavior of nvt.Dataset should be a better trade-off in GPU memory peak and time

bschifferer avatar Jan 27 '22 19:01 bschifferer

I was able to reproduce the high memory usage locally, but I did not have a chance to compare with an earlier version of NVTabular to see if this is a regression (or if it has always been the case). As far as I can tell, we can expect the minimum-possible memory usage to be 2x the partition size (the size of the current and pre-fetched partition). However, we also know that the pytorch loader will likely double this memory usage by coverting to dlpack. Therefore, if you are asking for 4GB partitions, I would expect the minimum steady-state memory usage to be ~16GB.

rjzamora avatar Feb 01 '22 20:02 rjzamora

@jperez999

I tested 22.02 pytorch-container with pulling the latest main branch.

  1. I initialize all variables (run the code above without batch = next(iter(train_loader))) - nvidia-smi shows 1250MiB
  2. If I run batch = next(iter(train_loader)) - nvidia-smi shows 24946MiB / 32510MiB.

The data loader consumes 75% of GPU memory.

bschifferer avatar Feb 17 '22 20:02 bschifferer

I tried to run the same code on TensorFlow. However, I cannot execute batch = next(iter(train_dataset_tf)) for a NVTabular TensorFlow dataloader

bschifferer avatar Feb 17 '22 21:02 bschifferer

@jperez999 @bschifferer What's the next step with this?

karlhigley avatar Mar 22 '22 17:03 karlhigley