Bug description

Steps/Code to reproduce bug

I have four partition in 4 train1/part_0.parquet train1/part_1.parquet train1/part_2.parquet train1/part_3.parquet and four gpus I still get the error mesage: UserWarning: You have more processes(4) than dataset [1,3]: partitions(1), reduce the number of processes.

Expected behavior

Environment details. merlin-pytorch:22.12” image.

Merlin version:
Platform:
Python version: 3.8
PyTorch version (GPU?):
Tensorflow version (GPU?):2.10

Additional context

Jan 26 '23 06:01 ssubbayya

@ssubbayya this looks like a warning msg .is there an error stack after this msg? or training starts and finishes? @edknv do you mind to put your insight here?

Jan 26 '23 17:01 rnyak

@ssubbayya Can you please share more information on how you arrived at that warning? A minimal reproducible code would be great. I'm particularly confused because you are using the merlin-pytorch image but Merlin Models doesn't have pytorch support yet.

Jan 26 '23 17:01 edknv

@ssubbayya I assumed you were using merlin-tensorflow:22.12 image? if you want to use Merlin Model it currently only supports Tensorflow, as @edknv mentioned.

Jan 26 '23 18:01 rnyak

Sorry, I am using nvcr.io/nvidia/merlin/merlin-tensorflow:22.12

Jan 26 '23 20:01 ssubbayya

I am trying to run the following code; I have 4 different .parquet files. %%writefile './tf_trainer.py'

import os

MPI_SIZE = int(os.getenv("OMPI_COMM_WORLD_SIZE")) MPI_RANK = int(os.getenv("OMPI_COMM_WORLD_RANK"))

os.environ["CUDA_VISIBLE_DEVICES"] = str(MPI_RANK)

import nvtabular as nvt from nvtabular.ops import *

from merlin.models.utils.example_utils import workflow_fit_transform from merlin.schema.tags import Tags

import merlin.models.tf as mm from merlin.io.dataset import Dataset import tensorflow as tf

import argparse

parser = argparse.ArgumentParser( description='Hyperparameters for model training' ) parser.add_argument( '--batch-size', type=str, help='Batch-Size per GPU worker' ) parser.add_argument( '--path', type=str, help='Directory with training and validation data' ) args = parser.parse_args()

define train and valid dataset objects

train = Dataset(os.path.join(args.path, "train", "part_" + str(MPI_RANK) + ".parquet")) valid = Dataset(os.path.join(args.path, "valid", "part_" + str(MPI_RANK) + ".parquet"))

define schema object

target_column = train.schema.select_by_tag(Tags.TARGET).column_names[0]

train_loader = mm.Loader( train, schema=train.schema, batch_size=int(args.batch_size), shuffle=True, drop_last=True, )

valid_loader = mm.Loader( valid, schema=valid.schema, batch_size=int(args.batch_size), shuffle=False, drop_last=True, )

print("Number batches: " + str(len(train_loader)))

model = mm.DLRMModel( train.schema, embedding_dim=16, bottom_block=mm.MLPBlock([32, 16]), top_block=mm.MLPBlock([32, 16]), prediction_tasks=mm.BinaryOutput(target_column), )

opt = tf.keras.optimizers.Adagrad(learning_rate=0.01) model.compile(optimizer=opt, run_eagerly=False, metrics=[tf.keras.metrics.AUC()]) losses = model.fit( train_loader )

print(model.evaluate(valid, batch_size=int(args.batch_size), return_dict=True)) horovodrun -np 4 python tf_trainer.py --batch-size 16834 --path output

Jan 26 '23 21:01 ssubbayya

Hello @ssubbayya , thanks for reporting the bug. You are correct. I found a workaround that it will train:

You need to:

add parameters global_size=1, global_rank=0 when initialising the dataloaders
you need to overwrite the property global_rank=0 after initialization. MerlinModels will overwrite the value during initalization
use valid_loader instead of valid in model.evaluation

train_loader = mm.Loader(
train,
schema=train.schema,
batch_size=int(args.batch_size),
shuffle=True,
drop_last=True,
global_size=1,
global_rank=0,
)

valid_loader = mm.Loader(
valid,
schema=valid.schema,
batch_size=int(args.batch_size),
shuffle=False,
drop_last=True,
global_size=1,
global_rank=0,
)

train_loader.global_rank = 0
valid_loader.global_rank = 0
print("Number batches: " + str(len(train_loader)))

model = mm.DLRMModel(
train.schema,
embedding_dim=16,
bottom_block=mm.MLPBlock([32, 16]),
top_block=mm.MLPBlock([32, 16]),
prediction_tasks=mm.BinaryOutput(target_column),
)

opt = tf.keras.optimizers.Adagrad(learning_rate=0.01)
model.compile(optimizer=opt, run_eagerly=False, metrics=[tf.keras.metrics.AUC()])
losses = model.fit(
train_loader
)

print(model.evaluate(valid_loader, batch_size=int(args.batch_size), return_dict=True))```

Feb 02 '23 10:02 bschifferer

Hi @bschifferer Thanks very much for your efforts to track this bug. It is always running out of memory. It looks like it runs out memory even for a small data. My data has 4 parquet files for training data totaling 1.3 GB and 4 validation parquet files totaling 1.3 GB. I have 4 gpus with each 32 GB each. I think distributed processing works; but the distributed gpu training does not work in tensorflow. 2023-02-04 01:01:41.279308: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [1,3]:2023-02-04 01:01:46.876865: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX [1,3]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [1,0]:2023-02-04 01:01:47.053076: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX [1,0]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [1,1]:2023-02-04 01:01:47.107636: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX [1,1]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [1,2]:2023-02-04 01:01:47.138848: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX [1,2]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [1,3]:WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.data_structures has been moved to tensorflow.python.trackable.data_structures. The old module will be deleted in version 2.11. [1,1]:WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.data_structures has been moved to tensorflow.python.trackable.data_structures. The old module will be deleted in version 2.11. [1,2]:WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.data_structures has been moved to tensorflow.python.trackable.data_structures. The old module will be deleted in version 2.11. [1,0]:WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.data_structures has been moved to tensorflow.python.trackable.data_structures. The old module will be deleted in version 2.11. [1,0]:2023-02-04 01:01:49.537434: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX [1,0]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [1,3]:2023-02-04 01:01:49.538719: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX [1,3]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [1,2]:2023-02-04 01:01:49.538837: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX [1,2]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [1,1]:2023-02-04 01:01:49.540965: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX [1,1]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [1,0]:2023-02-04 01:01:51.383456: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1637] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16384 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3a:00.0, compute capability: 7.0 [1,2]:2023-02-04 01:01:51.408323: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1637] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16384 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:89:00.0, compute capability: 7.0 [1,3]:2023-02-04 01:01:51.409386: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1637] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16384 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:8a:00.0, compute capability: 7.0 [1,1]:2023-02-04 01:01:51.527416: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1637] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16384 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0 [1,1]:Number batches: 23 [1,3]:Number batches: 17 [1,2]:Number batches: 23 [1,0]:Number batches: 23 [1,1]:Traceback (most recent call last): [1,1]: File "tf_trainer.py", line 77, in [1,1]: losses = model.fit( [1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 969, in fit [1,1]: out = super().fit(fit_kwargs) [1,1]: File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 70, in error_handler [1,1]: raise e.with_traceback(filtered_tb) from None [1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/tensorflow.py", line 154, in getitem [1,1]: return LoaderBase.next(self) [1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 282, in next** [1,1]: return self._get_next_batch() [1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 349, in _get_next_batch [1,1]: self._fetch_chunk() [1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 298, in _fetch_chunk [1,1]: raise chunks [1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 765, in load_chunks [1,1]: self.chunk_logic(itr) [1,1]: File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner [1,1]: result = func(*args, **kwargs) [1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 747, in chunk_logic [1,1]: chunks = self.dataloader.make_tensors(chunks, self.dataloader._use_nnz) [1,1]: File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner [1,1]: result = func(*args, **kwargs) [1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 392, in make_tensors [1,1]: chunks, names = self._create_tensors(gdf) [1,1]: File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner [1,1]: result = func(*args, **kwargs) [1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 569, in _create_tensors [1,1]: x = self._to_tensor(gdf_i[scalars]) [1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/tensorflow.py", line 220, in _to_tensor [1,1]: dlpack = self._pack(gdf.values.T) [1,1]: File "/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py", line 426, in values [1,1]: return self.to_cupy() [1,1]: File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner [1,1]: result = func(args, kwargs) [1,1]: File "/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py", line 526, in to_cupy [1,1]: return self._to_array( [1,1]: File "/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py", line 484, in _to_array [1,1]: matrix = make_empty_matrix( [1,1]: File "/usr/local/lib/python3.8/dist-packages/cupy/_creation/basic.py", line 22, in empty [1,1]: return cupy.ndarray(shape, dtype, order=order) [1,1]: File "cupy/_core/core.pyx", line 171, in cupy._core.core.ndarray.init [1,1]: File "cupy/cuda/memory.pyx", line 698, in cupy.cuda.memory.alloc [1,1]: File "/usr/local/lib/python3.8/dist-packages/rmm/rmm.py", line 232, in rmm_cupy_allocator [1,1]: buf = librmm.device_buffer.DeviceBuffer(size=nbytes, stream=stream) [1,1]: File "device_buffer.pyx", line 88, in rmm._lib.device_buffer.DeviceBuffer.cinit [1,1]:MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /usr/local/include/rmm/mr/device/cuda_memory_resource.hpp:70: cudaErrorMemoryAllocation out of memory [1,0]:Traceback (most recent call last): [1,0]: File "tf_trainer.py", line 77, in [1,0]: losses = model.fit( [1,0]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 969, in fit [1,0]: out = super().fit(fit_kwargs) [1,0]: File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 70, in error_handler [1,0]: raise e.with_traceback(filtered_tb) from None [1,0]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/tensorflow.py", line 154, in getitem [1,0]: return LoaderBase.next(self) [1,0]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 282, in next [1,0]: return self._get_next_batch() [1,0]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 349, in _get_next_batch [1,0]: self._fetch_chunk() [1,0]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 298, in _fetch_chunk [1,0]: raise chunks [1,0]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 765, in load_chunks [1,0]: self.chunk_logic(itr) [1,0]: File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner [1,0]: result = func(args, **kwargs) [1,0]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 747, in chunk_logic [1,0]: chunks = self.dataloader.make_tensors(chunks, self.dataloader._use_nnz) [1,0]: File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner [1,0]: result = func(*args, **kwargs) [1,0]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 392, in make_tensors [1,0]: chunks, names = self._create_tensors(gdf) [1,0]: File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner [1,0]: result = func(*args, **kwargs) [1,0]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 569, in _create_tensors [1,0]: x = self._to_tensor(gdf_i[scalars]) [1,0]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/tensorflow.py", line 220, in _to_tensor [1,0]: dlpack = self._pack(gdf.values.T) [1,0]: File "/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py", line 426, in values [1,0]: return self.to_cupy() [1,0]: File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner [1,0]: result = func(args, kwargs) [1,0]: File "/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py", line 526, in to_cupy [1,0]: return self._to_array( [1,0]: File "/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py", line 484, in _to_array [1,0]: matrix = make_empty_matrix( [1,0]: File "/usr/local/lib/python3.8/dist-packages/cupy/_creation/basic.py", line 22, in empty [1,0]: return cupy.ndarray(shape, dtype, order=order) [1,0]: File "cupy/_core/core.pyx", line 171, in cupy._core.core.ndarray.init [1,0]: File "cupy/cuda/memory.pyx", line 698, in cupy.cuda.memory.alloc [1,0]: File "/usr/local/lib/python3.8/dist-packages/rmm/rmm.py", line 232, in rmm_cupy_allocator [1,0]: buf = librmm.device_buffer.DeviceBuffer(size=nbytes, stream=stream) [1,0]: File "device_buffer.pyx", line 88, in rmm._lib.device_buffer.DeviceBuffer.cinit [1,0]:MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /usr/local/include/rmm/mr/device/cuda_memory_resource.hpp:70: cudaErrorMemoryAllocation out of memory [1,2]:Traceback (most recent call last): [1,2]: File "tf_trainer.py", line 77, in [1,2]: losses = model.fit( [1,2]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 969, in fit [1,2]: out = super().fit(fit_kwargs) [1,2]: File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 70, in error_handler [1,2]: raise e.with_traceback(filtered_tb) from None [1,2]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/tensorflow.py", line 154, in getitem [1,2]: return LoaderBase.next(self) [1,2]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 282, in next [1,2]: return self._get_next_batch() [1,2]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 349, in _get_next_batch [1,2]: self._fetch_chunk() [1,2]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 298, in _fetch_chunk [1,2]: raise chunks [1,2]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 765, in load_chunks [1,2]: self.chunk_logic(itr) [1,2]: File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner [1,2]: result = func(args, **kwargs) [1,2]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 747, in chunk_logic [1,2]: chunks = self.dataloader.make_tensors(chunks, self.dataloader._use_nnz) [1,2]: File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner [1,2]: result = func(*args, **kwargs) [1,2]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 392, in make_tensors [1,2]: chunks, names = self._create_tensors(gdf) [1,2]: File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner [1,2]: result = func(*args, **kwargs) [1,2]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 569, in _create_tensors [1,2]: x = self._to_tensor(gdf_i[scalars]) [1,2]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/tensorflow.py", line 220, in _to_tensor [1,2]: dlpack = self._pack(gdf.values.T) [1,2]: File "/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py", line 426, in values [1,2]: return self.to_cupy() [1,2]: File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner [1,2]: result = func(*args, **kwargs) [1,2]: File "/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py", line 526, in to_cupy [1,2]: return self._to_array( [1,2]: File "/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py", line 484, in _to_array [1,2]: matrix = make_empty_matrix( [1,2]: File "/usr/local/lib/python3.8/dist-packages/cupy/_creation/basic.py", line 22, in empty [1,2]: return cupy.ndarray(shape, dtype, order=order) [1,2]: File "cupy/_core/core.pyx", line 171, in cupy._core.core.ndarray.init [1,2]: File "cupy/cuda/memory.pyx", line 698, in cupy.cuda.memory.alloc [1,2]: File "/usr/local/lib/python3.8/dist-packages/rmm/rmm.py", line 232, in rmm_cupy_allocator [1,2]: buf = librmm.device_buffer.DeviceBuffer(size=nbytes, stream=stream) [1,2]: File "device_buffer.pyx", line 88, in rmm._lib.device_buffer.DeviceBuffer.cinit [1,2]:MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /usr/local/include/rmm/mr/device/cuda_memory_resource.hpp:70: cudaErrorMemoryAllocation out of memory

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[19430,1],1] Exit code: 1

Feb 04 '23 01:02 ssubbayya

train = Dataset(os.path.join(args.path, "train", "part_" + str(MPI_RANK) + ".parquet")) valid = Dataset(os.path.join(args.path, "valid", "part_" + str(MPI_RANK) + ".parquet"))

Can you try to add part_size parameter to the Dataset above?

Dataset(os.path.join(args.path, "train", "part_" + str(MPI_RANK) + ".parquet"), part_size='100MB') (or 300MB or 500MB)?

Feb 06 '23 16:02 bschifferer

@ssubbayya please also add these lines at the very beginning of your code/notebook:

import os
os.environ["TF_GPU_ALLOCATOR"]="cuda_malloc_async"

Feb 06 '23 16:02 rnyak

@rnyak Thanks! It worked. Now the out of memory error is gone. Now, I get the following error.

2023-02-06 19:25:39.699632: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [1,2]:2023-02-06 19:25:45.142455: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX [1,2]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [1,0]:2023-02-06 19:25:45.297067: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX [1,0]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [1,3]:2023-02-06 19:25:45.297315: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX [1,3]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [1,1]:2023-02-06 19:25:45.354235: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX [1,1]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [1,2]:WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.data_structures has been moved to tensorflow.python.trackable.data_structures. The old module will be deleted in version 2.11. [1,0]:WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.data_structures has been moved to tensorflow.python.trackable.data_structures. The old module will be deleted in version 2.11. [1,3]:WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.data_structures has been moved to tensorflow.python.trackable.data_structures. The old module will be deleted in version 2.11. [1,1]:WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.data_structures has been moved to tensorflow.python.trackable.data_structures. The old module will be deleted in version 2.11. [1,2]:2023-02-06 19:25:47.727431: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX [1,2]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [1,1]:2023-02-06 19:25:47.727842: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX [1,1]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [1,3]:2023-02-06 19:25:47.729502: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX [1,3]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [1,0]:2023-02-06 19:25:47.729960: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX [1,0]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [1,3]:2023-02-06 19:25:49.552369: I tensorflow/core/common_runtime/gpu/gpu_process_state.cc:222] Using CUDA malloc Async allocator for GPU: 0 [1,3]:2023-02-06 19:25:49.553542: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1637] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16384 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:b3:00.0, compute capability: 7.0 [1,1]:2023-02-06 19:25:49.574705: I tensorflow/core/common_runtime/gpu/gpu_process_state.cc:222] Using CUDA malloc Async allocator for GPU: 0 [1,1]:2023-02-06 19:25:49.575211: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1637] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16384 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:16:00.0, compute capability: 7.0 [1,2]:2023-02-06 19:25:49.590874: I tensorflow/core/common_runtime/gpu/gpu_process_state.cc:222] Using CUDA malloc Async allocator for GPU: 0 [1,2]:2023-02-06 19:25:49.591319: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1637] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16384 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3a:00.0, compute capability: 7.0 [1,0]:2023-02-06 19:25:49.607551: I tensorflow/core/common_runtime/gpu/gpu_process_state.cc:222] Using CUDA malloc Async allocator for GPU: 0 [1,0]:2023-02-06 19:25:49.607823: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1637] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16384 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:15:00.0, compute capability: 7.0 [1,1]:Number batches: 53 [1,0]:Number batches: 53 [1,3]:Number batches: 47 [1,2]:Number batches: 53 [1,3]:Traceback (most recent call last): [1,3]: File "tf_trainer.py", line 77, in [1,3]: losses = model.fit( [1,3]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 969, in fit [1,3]: out = super().fit(**fit_kwargs) [1,3]: File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 70, in error_handler [1,3]: raise e.with_traceback(filtered_tb) from None [1,3]: File "/tmp/autograph_generated_filerhvin01v.py", line 15, in tf__train_function [1,3]: retval = ag_.converted_call(ag__.ld(step_function), (ag.ld(self), ag.ld(iterator)), None, fscope) [1,3]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 767, in train_step [1,3]: outputs = self.call_train_test(x, y, sample_weight=sample_weight, training=True) [1,3]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 674, in call_train_test [1,3]: self.adjust_predictions_and_targets(predictions, targets) [1,3]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 718, in adjust_predictions_and_targets [1,3]: targets[k] = tf.cast(targets[k], predictions[k].dtype) [1,3]:ValueError: in user code: [1,3]: [1,3]: File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1160, in train_function * [1,3]: return step_function(self, iterator) [1,3]: File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1146, in step_function [1,3]: outputs = model.distribute_strategy.run(run_step, args=(data,)) [1,3]: File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1135, in run_step [1,3]: outputs = model.train_step(data) [1,3]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 767, in train_step [1,3]: outputs = self.call_train_test(x, y, sample_weight=sample_weight, training=True) [1,3]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 674, in call_train_test [1,3]: self.adjust_predictions_and_targets(predictions, targets) [1,3]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 718, in adjust_predictions_and_targets [1,3]: targets[k] = tf.cast(targets[k], predictions[k].dtype) [1,3]: [1,3]: ValueError: None values not supported. [1,3]:

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

[1,1]:Traceback (most recent call last): [1,1]: File "tf_trainer.py", line 77, in [1,1]: losses = model.fit( [1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 969, in fit [1,1]: out = super().fit(**fit_kwargs) [1,1]: File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 70, in error_handler [1,1]: raise e.with_traceback(filtered_tb) from None [1,1]: File "/tmp/autograph_generated_filedtp7mhmf.py", line 15, in tf__train_function [1,1]: retval = ag_.converted_call(ag__.ld(step_function), (ag.ld(self), ag.ld(iterator)), None, fscope) [1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 767, in train_step [1,1]: outputs = self.call_train_test(x, y, sample_weight=sample_weight, training=True) [1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 674, in call_train_test [1,1]: self.adjust_predictions_and_targets(predictions, targets) [1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 718, in adjust_predictions_and_targets [1,1]: targets[k] = tf.cast(targets[k], predictions[k].dtype) [1,1]:ValueError: in user code: [1,1]: [1,1]: File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1160, in train_function * [1,1]: return step_function(self, iterator) [1,1]: File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1146, in step_function [1,1]: outputs = model.distribute_strategy.run(run_step, args=(data,)) [1,1]: File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1135, in run_step [1,1]: outputs = model.train_step(data) [1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 767, in train_step [1,1]: outputs = self.call_train_test(x, y, sample_weight=sample_weight, training=True) [1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 674, in call_train_test [1,1]: self.adjust_predictions_and_targets(predictions, targets) [1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 718, in adjust_predictions_and_targets [1,1]: targets[k] = tf.cast(targets[k], predictions[k].dtype) [1,1]: [1,1]: ValueError: None values not supported. [1,1]:

mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[59958,1],3] Exit code: 1

Feb 06 '23 20:02 ssubbayya

@ssubbayya

ValueError: None values not supported. sounds that the dataset contains NaN values / None values, is that correct?

You should be able to test it like this Dataset().to_ddf().isna().sum().compute()

Can you fill Nan values?

It seems that the dataset is not balanced: [1,1]:Number batches: 53 [1,0]:Number batches: 53 [1,3]:Number batches: 47 [1,2]:Number batches: 53

Worker 3 has only 47 batches and the other ones have 53. I think that will be another problem after the Nan values

Feb 07 '23 13:02 bschifferer

@rnyak The same dataset works well for pytorch. I checked it does not have Nan. I think some function is returning None value.
I do not know how to fix the unequal number of batches. That is the way nvtabular partitions the data.

Feb 07 '23 19:02 ssubbayya

@rnyak print(train.to_ddf().isna().sum().compute().sum()) print(train1.to_ddf().isna().sum().compute().sum()) print(train2.to_ddf().isna().sum().compute().sum()) print(train3.to_ddf().isna().sum().compute().sum()) All returned zero. It looks like some of the tensorflow/nvidia functions might have indenting issues. If there are indenting issues it might return none.

Feb 07 '23 19:02 ssubbayya

[BUG] UserWarning: You have more processes(4) than dataset [1,1]<stderr>: partitions(1), reduce the number of processes.

Bug description

Steps/Code to reproduce bug

Expected behavior

Environment details. merlin-pytorch:22.12” image.

Additional context

define train and valid dataset objects

define schema object

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[19430,1],1] Exit code: 1

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[59958,1],3] Exit code: 1