[BUG] Dataloader doesnt release memory and memory growth
Bug description
I run a training script and reinitialize nvtabular dataloaders. After each initialization, the available GPU memory decreases (fmem = pynvml_mem_size(kind="free", index=0) ). That is unexpected. The available GPU memory should be constant. After training a while, I run OOM for that reason. The use case is that I want to train per day -> each day is a separate file.
Results: [41587769344, 41113812992, 40744714240, 40509833216, 40390295552, 39972962304, 39866007552, 39635320832, 39299776512, 39115227136, 38846791680, 38544801792, 38276366336, 38007930880, 37789827072, 37521391616, 37269733376, 37018075136, 36758028288, 36514758656, 36263100416, 36112105472, 35701063680, 35474571264, 35222913024]
Steps/Code to reproduce bug
Data Generation
import cudf
df = cudf.DataFrame({
'col1': list(range(100000000)),
'col2': list(range(100000000)),
'col3': list(range(100000000)),
'target1': list(range(100000000)),
'target2': list(range(100000000))
})
df.to_parquet('test'+str(0)+'.parquet')
Executing Script
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices("GPU")
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
import merlin.models.tf.dataset as tf_dataloader
from merlin.schema.tags import Tags
import nvtabular as nvt
from nvtabular.utils import pynvml_mem_size
def map_output(x,y):
out = []
for tg in ['target1', 'target2']:
out.append(y[tg])
y = tf.concat(out, axis=1)
return x, y
import gc
import glob
BATCH_SIZE = 2*64*1024
fmems = []
for j in range(25):
print(j)
fmem = pynvml_mem_size(kind="free", index=0)
print(fmem)
fmems.append(fmem)
gc.collect()
files = sorted(glob.glob(
'test'+str(0)+'.parquet'
))
train = nvt.Dataset(files, part_size="100MB")
train_dl = tf_dataloader.BatchedDataset(
train,
batch_size = 1024*64,
shuffle=True,
drop_last=True,
cat_names=['col1', 'col2', 'col3'],
label_names=['target1', 'target2']
).map(map_output)
gc.collect()
for i, (inputs, labels) in enumerate(train_dl):
if i>10:
del train, train_dl
break
Expected behavior
The available memory should be constant.
The potential solution is to call train_dl.stop() before del train, train_dl.
@benfred , please check if there is a fix from @oliverholworthy and link it. Thank you
@bschifferer did you test that solution and did it work?