DeepRec multi-machine, multi-gpu sok core dump

System information

OS Platform and Distribution (e.g., Linux Ubuntu 20.04):
DeepRec version or commit id: 3204 release
Python version: 3.8
Bazel version (if compiling from source): 5.3.1
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version: 11.6.2

Describe the current behavior [1,2]:[n193-019-222:14623] [ 1] /opt/tiger/jdk/jdk1.8/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0xb6)[0x7fb6a01cf826] [1,2]:[n193-019-222:14623] [ 2] /opt/tiger/jdk/jdk1.8/jre/lib/amd64/server/libjvm.so(+0x921e13)[0x7fb6a01c5e13] [1,2]:[n193-019-222:14623] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fb789fe2090] [1,2]:[n193-019-222:14623] [ 4] /usr/lib/python3.8/site-packages/merlin_sok-1.1.4-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libcore.so(ZNSt8__detail9_Map_baseIN4core6DeviceESt4pairIKS2_St10shared_ptrINS1_12IStorageImplEEESaIS8_ENS_10_Select1stESt8equal_toIS2_ESt4hashIS2_ENS_18_Mod_range_hashingENS_20_Default_ranged_hashENS_20_Prime_rehash_policyENS_17_Hashtable_traitsILb1ELb0ELb1EEELb1EEixERS4+0x173)[0x7fb5a5025e43] [1,2]:[n193-019-222:14623] [ 5] /usr/lib/python3.8/site-packages/merlin_sok-1.1.4-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libcore.so(_ZN4core10BufferImpl7reserveERKNS_5ShapeENS_6DeviceENS_8DataTypeEm+0x313)[0x7fb5a5025143] [1,2]:[n193-019-222:14623] [ 6] /usr/lib/python3.8/site-packages/merlin_sok-1.1.4-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libembedding.so(_ZN9embedding33UniformModelParallelEmbeddingMetaC1ESt10shared_ptrIN4core19CoreResourceManagerEERKNS_24EmbeddingCollectionParamEm+0x2559)[0x7fb5a3627879] [1,2]:[n193-019-222:14623] [ 7] /usr/lib/python3.8/site-packages/merlin_sok-1.1.4-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libsok_experiment.so(_ZN10tensorflow23EmbeddingCollectionBaseIxxfE11update_metaESt10shared_ptrIN4core19CoreResourceManagerEEiRSt6vectorIiSaIiEE+0x131)[0x7fb5a30162e1] [1,2]:[n193-019-222:14623] [ 8] /usr/lib/python3.8/site-packages/merlin_sok-1.1.4-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libsok_experiment.so(_ZN10tensorflow30LookupForwardEmbeddingVarGPUOpIxxfE7ComputeEPNS_15OpKernelContextE+0x891)[0x7fb5a303d9f1] [1,2]:[n193-019-222:14623] [ 9] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0xdc)[0x7fb6a1fa3bbc] [1,2]:[n193-019-222:14623] [10] [n193-019-222:14623] [ 0] [1,4]:[n193-019-222:14625] *** Process received signal *** Describe the expected behavior

Code to reproduce the issue

the model we use is modelzoo/deepfm, with no code modify
we use mpi to run, the command is as following mpirun -np 16 --map-by ppr:4:socket -bind-to socket --hostfile ./hostfile --allow-run-as-root --tag-output --report-bindings --mca pml ob1 --mca btl ^openib --mca btl_tcp_if_exclude lo,docker0,bond0 --wdir /home/tiger/deeprec -x NCCL_IB_DISABLE=0 -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_HCA=mlx5 -x NCCL_DEBUG=INFO -x NCCL_IB_TIMEOUT=25 -x NCCL_IB_RETRY_CNT=7 -x NCCL_SOCKET_IFNAME=eth0 -x HOROVOD_MPI_THREADS_DISABLE=0 -x TF_GPU_CUPTI_FORCE_CONCURRENT_KERNEL=1 -x YARN_CONTAINER_RESOURCE_PREFIX_VCORES -x NV_LIBCUBLAS_DEV_PACKAGE_NAME -x HTTPS_PROXY -x TOTAL_ORACLES -x NV_LIBCUBLAS_PACKAGE -x GLOG_log_dir -x NV_LIBNCCL_DEV_PACKAGE_VERSION -x YARN_APP_ID -x NM_LABEL -x YARN_CONTAINER_RESOURCE_PREFIX_YARN_IO_TPU_V3_POD -x OOM_LISTEN_MODE -x SEC_TOKEN_PATH -x YARN_CONTAINER_RESOURCE_PREFIX_YARN_IO_PORT -x NVIDIA_PRODUCT_NAME -x PRIMUS_AM_RPC_PORT -x NV_LIBCUSPARSE_DEV_VERSION -x NUM_OF_PRIMUS_worker -x YARN_CONTAINER_RUNTIME_DOCKER_IMAGE -x NV_CUDNN_VERSION -x NV_LIBNPP_DEV_VERSION -x CUDA_VERSION -x PATH -x HTTP_PROXY -x NV_LIBNPP_DEV_PACKAGE -x API_SERVER_PORT -x NV_CUDNN_PACKAGE_NAME -x PRIMUS_ROLE_CATEGORY -x YARN_CLASS_ID -x LIBHDFS_OPTS -x ENV_DOCKER_CONTAINER_SECURITY_OPTION -x NV_LIBNCCL_DEV_PACKAGE_NAME -x ENABLE_OOM_LISTENER -x NM_PORT -x API_SERVER_HOST -x NCCL_VERSION -x NM_HTTP_PORT -x NV_LIBNCCL_PACKAGE_VERSION -x YARN_APP_PRIORITY -x YARN_APP_TYPE -x START_STATISTIC_STEP -x NVIDIA_DRIVER_CAPABILITIES -x TZ -x SHUFFLE_DISK_MANAGER_PORT -x YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS -x NM_AUX_SERVICE_mapreduce_shuffle -x SEC_KV_AUTH -x TF_SCRIPT -x CLASSPATH -x LOCAL_DIRS -x HADOOP_YARN_HOME -x NV_LIBCUBLAS_DEV_VERSION -x HADOOP_CONF_DIR -x NO_PROXY -x LIBRARY_PATH -x NV_LIBNPP_PACKAGE -x PRIMUS_EXECUTOR_UNIQUE_ID -x PRIMUS_AM_RPC_HOST -x NV_NVPROF_DEV_PACKAGE -x NV_NVML_DEV_VERSION -x YARN_CONTAINER_RESOURCE_PREFIX_MEMORY_MB -x YARN_CONTAINER_RESOURCE_PREFIX_YARN_IO_TPU_V3_BASE -x NV_CUDA_LIB_VERSION -x RUNTIME_IDC_NAME -x TF_CONFIG -x YARN_APP_TAGS -x NV_LIBCUBLAS_DEV_PACKAGE -x LC_CTYPE -x NVARCH -x NV_CUDA_CUDART_DEV_VERSION -x NLSPATH -x ENV_DOCKER_CONTAINER_SHM_SIZE -x SHLVL -x TF_WORKSPACE -x JEMALLOC_PATH -x XFILESEARCHPATH -x SPARK_3_SHUFFLE_SERVICE_PORT -x NV_LIBCUBLAS_PACKAGE_NAME -x NM_HOST -x PRIMUS_SUBMIT_TIMESTAMP -x STOP_STATISTIC_STEP -x PYTHONPATH -x NV_LIBNCCL_PACKAGE_NAME -x YARN_QUEUE_ID -x ENV_DOCKER_CONTAINER_DEVICE -x ROLES_LIST -x YARN_USER -x LOAD_SERVICE_PSM -x YARN_CONTAINER_RESOURCE_PREFIX_YARN_IO_GPU -x PRIMUS_EXECUTOR_UNIQID -x NV_NVPROF_VERSION -x JAVA_HOME -x NVIDIA_REQUIRE_CUDA -x YARN_CONTAINER_RUNTIME_TYPE -x SPARK_SHUFFLE_SERVICE_PORT -x ENV_DOCKER_CONTAINER_CAP_ADD -x MALLOC_ARENA_MAX -x SSD_MANAGER_PORT -x YARN_QUEUE_NAME -x NV_NVTX_VERSION -x YODEL_MODE -x NV_CUDA_CUDART_VERSION -x BYTED_HOST_IPV6 -x NV_CUDA_COMPAT_PACKAGE -x LD_LIBRARY_PATH -x HADOOP_TOKEN_FILE_LOCATION -x LOG_DIRS -x APPLICATION_ID -x HOME -x NV_LIBCUSPARSE_VERSION -x HADOOP_COMMON_HOME -x HADOOP_HDFS_HOME -x OLDPWD -x NV_LIBNCCL_PACKAGE -x MEM_USAGE_STRATEGY -x PWD -x NV_LIBCUBLAS_VERSION -x ENV_DOCKER_CONTAINER_ULIMIT -x LOGNAME -x NV_CUDNN_PACKAGE -x PRIMUS_STAGING_DIR -x NV_LIBNCCL_DEV_PACKAGE -x NVIDIA_VISIBLE_DEVICES -x NV_LIBNPP_VERSION -x YARN_CONTAINER_RESOURCE_PREFIX_VCORES_MILLI -x HADOOP_HOME -x CORE_DUMP_PROC_NAME -x NV_CUDNN_PACKAGE_DEV -x USER python3 train.py --output_dir=hdfs://harunava/user/xxx/deeprec_v10 --data_location=hdfs://harunava/user/xxx/criteo_small --protocol=grpc --smartstaged=false --batch_size=2048 --steps=30000 --ev=true --ev_elimination=l2 --ev_filter=counter --op_fusion=true --input_layer_partitioner=0 --dense_layer_partitioner=16 --group_embedding=collective --workqueue=true --parquet_dataset=false

Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

Apr 27 '23 07:04 wangcaihua

@Mesilenceki @shijieliu

May 26 '23 14:05 liutongxuan

这是来自QQ邮箱的假期自动回复邮件。您好，我最近正在休假中，无法亲自回复您的邮件。我将在假期结束后，尽快给您回复。

May 26 '23 14:05 wangcaihua

DeepRec DeepRec copied to clipboard

multi-machine, multi-gpu sok core dump

DeepRec
DeepRec copied to clipboard