models [BUG] Dataloader doesnt release memory and memory growth

Bug description

I run a training script and reinitialize nvtabular dataloaders. After each initialization, the available GPU memory decreases (fmem = pynvml_mem_size(kind="free", index=0) ). That is unexpected. The available GPU memory should be constant. After training a while, I run OOM for that reason. The use case is that I want to train per day -> each day is a separate file.

Results: [41587769344, 41113812992, 40744714240, 40509833216, 40390295552, 39972962304, 39866007552, 39635320832, 39299776512, 39115227136, 38846791680, 38544801792, 38276366336, 38007930880, 37789827072, 37521391616, 37269733376, 37018075136, 36758028288, 36514758656, 36263100416, 36112105472, 35701063680, 35474571264, 35222913024]

Steps/Code to reproduce bug

Data Generation

import cudf

df = cudf.DataFrame({
    'col1': list(range(100000000)),
    'col2': list(range(100000000)),
    'col3': list(range(100000000)),
    'target1': list(range(100000000)),
    'target2': list(range(100000000))
})

df.to_parquet('test'+str(0)+'.parquet')

Executing Script

import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices("GPU")
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

import merlin.models.tf.dataset as tf_dataloader
from merlin.schema.tags import Tags
import nvtabular as nvt

from nvtabular.utils import pynvml_mem_size

def map_output(x,y):
    out = []
    for tg in ['target1', 'target2']:
        out.append(y[tg])
    y = tf.concat(out, axis=1)
    return x, y

import gc
import glob

BATCH_SIZE = 2*64*1024
fmems = []

for j in range(25):
    print(j)
    fmem = pynvml_mem_size(kind="free", index=0)
    print(fmem)
    fmems.append(fmem)
    gc.collect()
    files = sorted(glob.glob(
        'test'+str(0)+'.parquet'
    ))
    train = nvt.Dataset(files, part_size="100MB")
    train_dl = tf_dataloader.BatchedDataset(
        train,
        batch_size = 1024*64,
        shuffle=True,
        drop_last=True,
        cat_names=['col1', 'col2', 'col3'],
        label_names=['target1', 'target2']
    ).map(map_output)
    gc.collect()
    for i, (inputs, labels) in enumerate(train_dl):
        if i>10:
            del train, train_dl
            break

Expected behavior

The available memory should be constant.

Sep 26 '22 15:09 bschifferer

The potential solution is to call train_dl.stop() before del train, train_dl.

Sep 26 '22 16:09 bschifferer

@benfred , please check if there is a fix from @oliverholworthy and link it. Thank you

Sep 26 '22 23:09 viswa-nvidia

@bschifferer did you test that solution and did it work?

Oct 26 '22 13:10 rnyak