[BUG] UserWarning: You have more processes(4) than dataset [1,1]<stderr>: partitions(1), reduce the number of processes.
Bug description
Steps/Code to reproduce bug
- I have four partition in 4
train1/part_0.parquet
train1/part_1.parquet
train1/part_2.parquet
train1/part_3.parquet
and four gpus
I still get the error mesage:
UserWarning: You have more processes(4) than dataset
[1,3]
: partitions(1), reduce the number of processes.
Expected behavior
Environment details. merlin-pytorch:22.12” image.
- Merlin version:
- Platform:
- Python version: 3.8
- PyTorch version (GPU?):
- Tensorflow version (GPU?):2.10
Additional context
@ssubbayya this looks like a warning msg .is there an error stack after this msg? or training starts and finishes? @edknv do you mind to put your insight here?
@ssubbayya Can you please share more information on how you arrived at that warning? A minimal reproducible code would be great. I'm particularly confused because you are using the merlin-pytorch image but Merlin Models doesn't have pytorch support yet.
@ssubbayya I assumed you were using merlin-tensorflow:22.12 image? if you want to use Merlin Model it currently only supports Tensorflow, as @edknv mentioned.
Sorry, I am using nvcr.io/nvidia/merlin/merlin-tensorflow:22.12
I am trying to run the following code; I have 4 different .parquet files. %%writefile './tf_trainer.py'
import os
MPI_SIZE = int(os.getenv("OMPI_COMM_WORLD_SIZE")) MPI_RANK = int(os.getenv("OMPI_COMM_WORLD_RANK"))
os.environ["CUDA_VISIBLE_DEVICES"] = str(MPI_RANK)
import nvtabular as nvt from nvtabular.ops import *
from merlin.models.utils.example_utils import workflow_fit_transform from merlin.schema.tags import Tags
import merlin.models.tf as mm from merlin.io.dataset import Dataset import tensorflow as tf
import argparse
parser = argparse.ArgumentParser( description='Hyperparameters for model training' ) parser.add_argument( '--batch-size', type=str, help='Batch-Size per GPU worker' ) parser.add_argument( '--path', type=str, help='Directory with training and validation data' ) args = parser.parse_args()
define train and valid dataset objects
train = Dataset(os.path.join(args.path, "train", "part_" + str(MPI_RANK) + ".parquet")) valid = Dataset(os.path.join(args.path, "valid", "part_" + str(MPI_RANK) + ".parquet"))
define schema object
target_column = train.schema.select_by_tag(Tags.TARGET).column_names[0]
train_loader = mm.Loader( train, schema=train.schema, batch_size=int(args.batch_size), shuffle=True, drop_last=True, )
valid_loader = mm.Loader( valid, schema=valid.schema, batch_size=int(args.batch_size), shuffle=False, drop_last=True, )
print("Number batches: " + str(len(train_loader)))
model = mm.DLRMModel( train.schema, embedding_dim=16, bottom_block=mm.MLPBlock([32, 16]), top_block=mm.MLPBlock([32, 16]), prediction_tasks=mm.BinaryOutput(target_column), )
opt = tf.keras.optimizers.Adagrad(learning_rate=0.01) model.compile(optimizer=opt, run_eagerly=False, metrics=[tf.keras.metrics.AUC()]) losses = model.fit( train_loader )
print(model.evaluate(valid, batch_size=int(args.batch_size), return_dict=True)) horovodrun -np 4 python tf_trainer.py --batch-size 16834 --path output
Hello @ssubbayya , thanks for reporting the bug. You are correct. I found a workaround that it will train:
You need to:
- add parameters global_size=1, global_rank=0 when initialising the dataloaders
- you need to overwrite the property global_rank=0 after initialization. MerlinModels will overwrite the value during initalization
- use valid_loader instead of valid in model.evaluation
train_loader = mm.Loader(
train,
schema=train.schema,
batch_size=int(args.batch_size),
shuffle=True,
drop_last=True,
global_size=1,
global_rank=0,
)
valid_loader = mm.Loader(
valid,
schema=valid.schema,
batch_size=int(args.batch_size),
shuffle=False,
drop_last=True,
global_size=1,
global_rank=0,
)
train_loader.global_rank = 0
valid_loader.global_rank = 0
print("Number batches: " + str(len(train_loader)))
model = mm.DLRMModel(
train.schema,
embedding_dim=16,
bottom_block=mm.MLPBlock([32, 16]),
top_block=mm.MLPBlock([32, 16]),
prediction_tasks=mm.BinaryOutput(target_column),
)
opt = tf.keras.optimizers.Adagrad(learning_rate=0.01)
model.compile(optimizer=opt, run_eagerly=False, metrics=[tf.keras.metrics.AUC()])
losses = model.fit(
train_loader
)
print(model.evaluate(valid_loader, batch_size=int(args.batch_size), return_dict=True))```
Hi @bschifferer
Thanks very much for your efforts to track this bug. It is always running out of memory. It looks like it runs out memory even for a small data. My data has 4 parquet files for training data totaling 1.3 GB and 4 validation parquet files totaling 1.3 GB. I have 4 gpus with each 32 GB each. I think distributed processing works; but the distributed gpu training does not work in tensorflow.
2023-02-04 01:01:41.279308: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,3]:2023-02-04 01:01:46.876865: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
[1,3]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,0]:2023-02-04 01:01:47.053076: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
[1,0]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,1]:2023-02-04 01:01:47.107636: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
[1,1]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,2]:2023-02-04 01:01:47.138848: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
[1,2]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,3]:WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.data_structures has been moved to tensorflow.python.trackable.data_structures. The old module will be deleted in version 2.11.
[1,1]:WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.data_structures has been moved to tensorflow.python.trackable.data_structures. The old module will be deleted in version 2.11.
[1,2]:WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.data_structures has been moved to tensorflow.python.trackable.data_structures. The old module will be deleted in version 2.11.
[1,0]:WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.data_structures has been moved to tensorflow.python.trackable.data_structures. The old module will be deleted in version 2.11.
[1,0]:2023-02-04 01:01:49.537434: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
[1,0]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,3]:2023-02-04 01:01:49.538719: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
[1,3]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,2]:2023-02-04 01:01:49.538837: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
[1,2]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,1]:2023-02-04 01:01:49.540965: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
[1,1]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,0]:2023-02-04 01:01:51.383456: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1637] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16384 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3a:00.0, compute capability: 7.0
[1,2]:2023-02-04 01:01:51.408323: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1637] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16384 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:89:00.0, compute capability: 7.0
[1,3]:2023-02-04 01:01:51.409386: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1637] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16384 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:8a:00.0, compute capability: 7.0
[1,1]:2023-02-04 01:01:51.527416: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1637] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16384 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0
[1,1]:Number batches: 23
[1,3]:Number batches: 17
[1,2]:Number batches: 23
[1,0]:Number batches: 23
[1,1]:Traceback (most recent call last):
[1,1]: File "tf_trainer.py", line 77, in
[1,1]: losses = model.fit(
[1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 969, in fit
[1,1]: out = super().fit(**fit_kwargs)
[1,1]: File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 70, in error_handler
[1,1]: raise e.with_traceback(filtered_tb) from None
[1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/tensorflow.py", line 154, in getitem
[1,1]: return LoaderBase.next(self)
[1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 282, in next
[1,1]: return self._get_next_batch()
[1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 349, in _get_next_batch
[1,1]: self._fetch_chunk()
[1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 298, in _fetch_chunk
[1,1]: raise chunks
[1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 765, in load_chunks
[1,1]: self.chunk_logic(itr)
[1,1]: File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner
[1,1]: result = func(*args, **kwargs)
[1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 747, in chunk_logic
[1,1]: chunks = self.dataloader.make_tensors(chunks, self.dataloader._use_nnz)
[1,1]: File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner
[1,1]: result = func(*args, **kwargs)
[1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 392, in make_tensors
[1,1]: chunks, names = self._create_tensors(gdf)
[1,1]: File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner
[1,1]: result = func(*args, **kwargs)
[1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 569, in _create_tensors
[1,1]: x = self._to_tensor(gdf_i[scalars])
[1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/tensorflow.py", line 220, in _to_tensor
[1,1]: dlpack = self._pack(gdf.values.T)
[1,1]: File "/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py", line 426, in values
[1,1]: return self.to_cupy()
[1,1]: File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner
[1,1]: result = func(*args, **kwargs)
[1,1]: File "/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py", line 526, in to_cupy
[1,1]: return self._to_array(
[1,1]: File "/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py", line 484, in _to_array
[1,1]: matrix = make_empty_matrix(
[1,1]: File "/usr/local/lib/python3.8/dist-packages/cupy/_creation/basic.py", line 22, in empty
[1,1]: return cupy.ndarray(shape, dtype, order=order)
[1,1]: File "cupy/_core/core.pyx", line 171, in cupy._core.core.ndarray.init
[1,1]: File "cupy/cuda/memory.pyx", line 698, in cupy.cuda.memory.alloc
[1,1]: File "/usr/local/lib/python3.8/dist-packages/rmm/rmm.py", line 232, in rmm_cupy_allocator
[1,1]: buf = librmm.device_buffer.DeviceBuffer(size=nbytes, stream=stream)
[1,1]: File "device_buffer.pyx", line 88, in rmm._lib.device_buffer.DeviceBuffer.cinit
[1,1]:MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /usr/local/include/rmm/mr/device/cuda_memory_resource.hpp:70: cudaErrorMemoryAllocation out of memory
[1,0]:Traceback (most recent call last):
[1,0]: File "tf_trainer.py", line 77, in
[1,0]: losses = model.fit(
[1,0]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 969, in fit
[1,0]: out = super().fit(**fit_kwargs)
[1,0]: File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 70, in error_handler
[1,0]: raise e.with_traceback(filtered_tb) from None
[1,0]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/tensorflow.py", line 154, in getitem
[1,0]: return LoaderBase.next(self)
[1,0]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 282, in next
[1,0]: return self._get_next_batch()
[1,0]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 349, in _get_next_batch
[1,0]: self._fetch_chunk()
[1,0]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 298, in _fetch_chunk
[1,0]: raise chunks
[1,0]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 765, in load_chunks
[1,0]: self.chunk_logic(itr)
[1,0]: File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner
[1,0]: result = func(*args, **kwargs)
[1,0]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 747, in chunk_logic
[1,0]: chunks = self.dataloader.make_tensors(chunks, self.dataloader._use_nnz)
[1,0]: File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner
[1,0]: result = func(*args, **kwargs)
[1,0]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 392, in make_tensors
[1,0]: chunks, names = self._create_tensors(gdf)
[1,0]: File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner
[1,0]: result = func(*args, **kwargs)
[1,0]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 569, in _create_tensors
[1,0]: x = self._to_tensor(gdf_i[scalars])
[1,0]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/tensorflow.py", line 220, in _to_tensor
[1,0]: dlpack = self._pack(gdf.values.T)
[1,0]: File "/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py", line 426, in values
[1,0]: return self.to_cupy()
[1,0]: File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner
[1,0]: result = func(*args, **kwargs)
[1,0]: File "/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py", line 526, in to_cupy
[1,0]: return self._to_array(
[1,0]: File "/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py", line 484, in _to_array
[1,0]: matrix = make_empty_matrix(
[1,0]: File "/usr/local/lib/python3.8/dist-packages/cupy/_creation/basic.py", line 22, in empty
[1,0]: return cupy.ndarray(shape, dtype, order=order)
[1,0]: File "cupy/_core/core.pyx", line 171, in cupy._core.core.ndarray.init
[1,0]: File "cupy/cuda/memory.pyx", line 698, in cupy.cuda.memory.alloc
[1,0]: File "/usr/local/lib/python3.8/dist-packages/rmm/rmm.py", line 232, in rmm_cupy_allocator
[1,0]: buf = librmm.device_buffer.DeviceBuffer(size=nbytes, stream=stream)
[1,0]: File "device_buffer.pyx", line 88, in rmm._lib.device_buffer.DeviceBuffer.cinit
[1,0]:MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /usr/local/include/rmm/mr/device/cuda_memory_resource.hpp:70: cudaErrorMemoryAllocation out of memory
[1,2]:Traceback (most recent call last):
[1,2]: File "tf_trainer.py", line 77, in
[1,2]: losses = model.fit(
[1,2]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 969, in fit
[1,2]: out = super().fit(**fit_kwargs)
[1,2]: File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 70, in error_handler
[1,2]: raise e.with_traceback(filtered_tb) from None
[1,2]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/tensorflow.py", line 154, in getitem
[1,2]: return LoaderBase.next(self)
[1,2]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 282, in next
[1,2]: return self._get_next_batch()
[1,2]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 349, in _get_next_batch
[1,2]: self._fetch_chunk()
[1,2]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 298, in _fetch_chunk
[1,2]: raise chunks
[1,2]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 765, in load_chunks
[1,2]: self.chunk_logic(itr)
[1,2]: File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner
[1,2]: result = func(*args, **kwargs)
[1,2]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 747, in chunk_logic
[1,2]: chunks = self.dataloader.make_tensors(chunks, self.dataloader._use_nnz)
[1,2]: File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner
[1,2]: result = func(*args, **kwargs)
[1,2]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 392, in make_tensors
[1,2]: chunks, names = self._create_tensors(gdf)
[1,2]: File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner
[1,2]: result = func(*args, **kwargs)
[1,2]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py", line 569, in _create_tensors
[1,2]: x = self._to_tensor(gdf_i[scalars])
[1,2]: File "/usr/local/lib/python3.8/dist-packages/merlin/dataloader/tensorflow.py", line 220, in _to_tensor
[1,2]: dlpack = self._pack(gdf.values.T)
[1,2]: File "/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py", line 426, in values
[1,2]: return self.to_cupy()
[1,2]: File "/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py", line 101, in inner
[1,2]: result = func(*args, **kwargs)
[1,2]: File "/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py", line 526, in to_cupy
[1,2]: return self._to_array(
[1,2]: File "/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py", line 484, in _to_array
[1,2]: matrix = make_empty_matrix(
[1,2]: File "/usr/local/lib/python3.8/dist-packages/cupy/_creation/basic.py", line 22, in empty
[1,2]: return cupy.ndarray(shape, dtype, order=order)
[1,2]: File "cupy/_core/core.pyx", line 171, in cupy._core.core.ndarray.init
[1,2]: File "cupy/cuda/memory.pyx", line 698, in cupy.cuda.memory.alloc
[1,2]: File "/usr/local/lib/python3.8/dist-packages/rmm/rmm.py", line 232, in rmm_cupy_allocator
[1,2]: buf = librmm.device_buffer.DeviceBuffer(size=nbytes, stream=stream)
[1,2]: File "device_buffer.pyx", line 88, in rmm._lib.device_buffer.DeviceBuffer.cinit
[1,2]:MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /usr/local/include/rmm/mr/device/cuda_memory_resource.hpp:70: cudaErrorMemoryAllocation out of memory
Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[19430,1],1] Exit code: 1
train = Dataset(os.path.join(args.path, "train", "part_" + str(MPI_RANK) + ".parquet")) valid = Dataset(os.path.join(args.path, "valid", "part_" + str(MPI_RANK) + ".parquet"))
Can you try to add part_size parameter to the Dataset above?
Dataset(os.path.join(args.path, "train", "part_" + str(MPI_RANK) + ".parquet"), part_size='100MB') (or 300MB or 500MB)?
@ssubbayya please also add these lines at the very beginning of your code/notebook:
import os
os.environ["TF_GPU_ALLOCATOR"]="cuda_malloc_async"
@rnyak Thanks! It worked. Now the out of memory error is gone. Now, I get the following error.
2023-02-06 19:25:39.699632: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,2]:2023-02-06 19:25:45.142455: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
[1,2]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,0]:2023-02-06 19:25:45.297067: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
[1,0]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,3]:2023-02-06 19:25:45.297315: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
[1,3]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,1]:2023-02-06 19:25:45.354235: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
[1,1]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,2]:WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.data_structures has been moved to tensorflow.python.trackable.data_structures. The old module will be deleted in version 2.11.
[1,0]:WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.data_structures has been moved to tensorflow.python.trackable.data_structures. The old module will be deleted in version 2.11.
[1,3]:WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.data_structures has been moved to tensorflow.python.trackable.data_structures. The old module will be deleted in version 2.11.
[1,1]:WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.data_structures has been moved to tensorflow.python.trackable.data_structures. The old module will be deleted in version 2.11.
[1,2]:2023-02-06 19:25:47.727431: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
[1,2]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,1]:2023-02-06 19:25:47.727842: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
[1,1]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,3]:2023-02-06 19:25:47.729502: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
[1,3]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,0]:2023-02-06 19:25:47.729960: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
[1,0]:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,3]:2023-02-06 19:25:49.552369: I tensorflow/core/common_runtime/gpu/gpu_process_state.cc:222] Using CUDA malloc Async allocator for GPU: 0
[1,3]:2023-02-06 19:25:49.553542: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1637] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16384 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:b3:00.0, compute capability: 7.0
[1,1]:2023-02-06 19:25:49.574705: I tensorflow/core/common_runtime/gpu/gpu_process_state.cc:222] Using CUDA malloc Async allocator for GPU: 0
[1,1]:2023-02-06 19:25:49.575211: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1637] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16384 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:16:00.0, compute capability: 7.0
[1,2]:2023-02-06 19:25:49.590874: I tensorflow/core/common_runtime/gpu/gpu_process_state.cc:222] Using CUDA malloc Async allocator for GPU: 0
[1,2]:2023-02-06 19:25:49.591319: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1637] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16384 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3a:00.0, compute capability: 7.0
[1,0]:2023-02-06 19:25:49.607551: I tensorflow/core/common_runtime/gpu/gpu_process_state.cc:222] Using CUDA malloc Async allocator for GPU: 0
[1,0]:2023-02-06 19:25:49.607823: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1637] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 16384 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:15:00.0, compute capability: 7.0
[1,1]:Number batches: 53
[1,0]:Number batches: 53
[1,3]:Number batches: 47
[1,2]:Number batches: 53
[1,3]:Traceback (most recent call last):
[1,3]: File "tf_trainer.py", line 77, in
[1,3]: losses = model.fit(
[1,3]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 969, in fit
[1,3]: out = super().fit(**fit_kwargs)
[1,3]: File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 70, in error_handler
[1,3]: raise e.with_traceback(filtered_tb) from None
[1,3]: File "/tmp/autograph_generated_filerhvin01v.py", line 15, in tf__train_function
[1,3]: retval = ag_.converted_call(ag__.ld(step_function), (ag__.ld(self), ag__.ld(iterator)), None, fscope)
[1,3]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 767, in train_step
[1,3]: outputs = self.call_train_test(x, y, sample_weight=sample_weight, training=True)
[1,3]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 674, in call_train_test
[1,3]: self.adjust_predictions_and_targets(predictions, targets)
[1,3]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 718, in adjust_predictions_and_targets
[1,3]: targets[k] = tf.cast(targets[k], predictions[k].dtype)
[1,3]:ValueError: in user code:
[1,3]:
[1,3]: File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1160, in train_function *
[1,3]: return step_function(self, iterator)
[1,3]: File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1146, in step_function **
[1,3]: outputs = model.distribute_strategy.run(run_step, args=(data,))
[1,3]: File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1135, in run_step **
[1,3]: outputs = model.train_step(data)
[1,3]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 767, in train_step
[1,3]: outputs = self.call_train_test(x, y, sample_weight=sample_weight, training=True)
[1,3]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 674, in call_train_test
[1,3]: self.adjust_predictions_and_targets(predictions, targets)
[1,3]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 718, in adjust_predictions_and_targets
[1,3]: targets[k] = tf.cast(targets[k], predictions[k].dtype)
[1,3]:
[1,3]: ValueError: None values not supported.
[1,3]:
Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.
[1,1]:Traceback (most recent call last):
[1,1]: File "tf_trainer.py", line 77, in
[1,1]: losses = model.fit(
[1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 969, in fit
[1,1]: out = super().fit(**fit_kwargs)
[1,1]: File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 70, in error_handler
[1,1]: raise e.with_traceback(filtered_tb) from None
[1,1]: File "/tmp/autograph_generated_filedtp7mhmf.py", line 15, in tf__train_function
[1,1]: retval = ag_.converted_call(ag__.ld(step_function), (ag__.ld(self), ag__.ld(iterator)), None, fscope)
[1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 767, in train_step
[1,1]: outputs = self.call_train_test(x, y, sample_weight=sample_weight, training=True)
[1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 674, in call_train_test
[1,1]: self.adjust_predictions_and_targets(predictions, targets)
[1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 718, in adjust_predictions_and_targets
[1,1]: targets[k] = tf.cast(targets[k], predictions[k].dtype)
[1,1]:ValueError: in user code:
[1,1]:
[1,1]: File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1160, in train_function *
[1,1]: return step_function(self, iterator)
[1,1]: File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1146, in step_function **
[1,1]: outputs = model.distribute_strategy.run(run_step, args=(data,))
[1,1]: File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1135, in run_step **
[1,1]: outputs = model.train_step(data)
[1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 767, in train_step
[1,1]: outputs = self.call_train_test(x, y, sample_weight=sample_weight, training=True)
[1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 674, in call_train_test
[1,1]: self.adjust_predictions_and_targets(predictions, targets)
[1,1]: File "/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py", line 718, in adjust_predictions_and_targets
[1,1]: targets[k] = tf.cast(targets[k], predictions[k].dtype)
[1,1]:
[1,1]: ValueError: None values not supported.
[1,1]:
mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[59958,1],3] Exit code: 1
@ssubbayya
ValueError: None values not supported. sounds that the dataset contains NaN values / None values, is that correct?
You should be able to test it like this Dataset().to_ddf().isna().sum().compute()
Can you fill Nan values?
It seems that the dataset is not balanced:
[1,1]:Number batches: 53 [1,0]:Number batches: 53 [1,3]:Number batches: 47 [1,2]:Number batches: 53
Worker 3 has only 47 batches and the other ones have 53. I think that will be another problem after the Nan values
@rnyak
The same dataset works well for pytorch. I checked it does not have Nan. I think some function is returning None value.
I do not know how to fix the unequal number of batches. That is the way nvtabular partitions the data.
@rnyak print(train.to_ddf().isna().sum().compute().sum()) print(train1.to_ddf().isna().sum().compute().sum()) print(train2.to_ddf().isna().sum().compute().sum()) print(train3.to_ddf().isna().sum().compute().sum()) All returned zero. It looks like some of the tensorflow/nvidia functions might have indenting issues. If there are indenting issues it might return none.