DeepRec icon indicating copy to clipboard operation
DeepRec copied to clipboard

multi-machine, multi-gpu sok core dump

Open wangcaihua opened this issue 1 year ago • 2 comments

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 20.04):
  • DeepRec version or commit id: 3204 release
  • Python version: 3.8
  • Bazel version (if compiling from source): 5.3.1
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: 11.6.2

Describe the current behavior [1,2]:[n193-019-222:14623] [ 1] /opt/tiger/jdk/jdk1.8/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0xb6)[0x7fb6a01cf826] [1,2]:[n193-019-222:14623] [ 2] /opt/tiger/jdk/jdk1.8/jre/lib/amd64/server/libjvm.so(+0x921e13)[0x7fb6a01c5e13] [1,2]:[n193-019-222:14623] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fb789fe2090] [1,2]:[n193-019-222:14623] [ 4] /usr/lib/python3.8/site-packages/merlin_sok-1.1.4-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libcore.so(ZNSt8__detail9_Map_baseIN4core6DeviceESt4pairIKS2_St10shared_ptrINS1_12IStorageImplEEESaIS8_ENS_10_Select1stESt8equal_toIS2_ESt4hashIS2_ENS_18_Mod_range_hashingENS_20_Default_ranged_hashENS_20_Prime_rehash_policyENS_17_Hashtable_traitsILb1ELb0ELb1EEELb1EEixERS4+0x173)[0x7fb5a5025e43] [1,2]:[n193-019-222:14623] [ 5] /usr/lib/python3.8/site-packages/merlin_sok-1.1.4-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libcore.so(_ZN4core10BufferImpl7reserveERKNS_5ShapeENS_6DeviceENS_8DataTypeEm+0x313)[0x7fb5a5025143] [1,2]:[n193-019-222:14623] [ 6] /usr/lib/python3.8/site-packages/merlin_sok-1.1.4-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libembedding.so(_ZN9embedding33UniformModelParallelEmbeddingMetaC1ESt10shared_ptrIN4core19CoreResourceManagerEERKNS_24EmbeddingCollectionParamEm+0x2559)[0x7fb5a3627879] [1,2]:[n193-019-222:14623] [ 7] /usr/lib/python3.8/site-packages/merlin_sok-1.1.4-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libsok_experiment.so(_ZN10tensorflow23EmbeddingCollectionBaseIxxfE11update_metaESt10shared_ptrIN4core19CoreResourceManagerEEiRSt6vectorIiSaIiEE+0x131)[0x7fb5a30162e1] [1,2]:[n193-019-222:14623] [ 8] /usr/lib/python3.8/site-packages/merlin_sok-1.1.4-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libsok_experiment.so(_ZN10tensorflow30LookupForwardEmbeddingVarGPUOpIxxfE7ComputeEPNS_15OpKernelContextE+0x891)[0x7fb5a303d9f1] [1,2]:[n193-019-222:14623] [ 9] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0xdc)[0x7fb6a1fa3bbc] [1,2]:[n193-019-222:14623] [10] [n193-019-222:14623] [ 0] [1,4]:[n193-019-222:14625] *** Process received signal *** Describe the expected behavior

Code to reproduce the issue

  1. the model we use is modelzoo/deepfm, with no code modify
  2. we use mpi to run, the command is as following mpirun -np 16 --map-by ppr:4:socket -bind-to socket --hostfile ./hostfile --allow-run-as-root --tag-output --report-bindings --mca pml ob1 --mca btl ^openib --mca btl_tcp_if_exclude lo,docker0,bond0 --wdir /home/tiger/deeprec -x NCCL_IB_DISABLE=0 -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_HCA=mlx5 -x NCCL_DEBUG=INFO -x NCCL_IB_TIMEOUT=25 -x NCCL_IB_RETRY_CNT=7 -x NCCL_SOCKET_IFNAME=eth0 -x HOROVOD_MPI_THREADS_DISABLE=0 -x TF_GPU_CUPTI_FORCE_CONCURRENT_KERNEL=1 -x YARN_CONTAINER_RESOURCE_PREFIX_VCORES -x NV_LIBCUBLAS_DEV_PACKAGE_NAME -x HTTPS_PROXY -x TOTAL_ORACLES -x NV_LIBCUBLAS_PACKAGE -x GLOG_log_dir -x NV_LIBNCCL_DEV_PACKAGE_VERSION -x YARN_APP_ID -x NM_LABEL -x YARN_CONTAINER_RESOURCE_PREFIX_YARN_IO_TPU_V3_POD -x OOM_LISTEN_MODE -x SEC_TOKEN_PATH -x YARN_CONTAINER_RESOURCE_PREFIX_YARN_IO_PORT -x NVIDIA_PRODUCT_NAME -x PRIMUS_AM_RPC_PORT -x NV_LIBCUSPARSE_DEV_VERSION -x NUM_OF_PRIMUS_worker -x YARN_CONTAINER_RUNTIME_DOCKER_IMAGE -x NV_CUDNN_VERSION -x NV_LIBNPP_DEV_VERSION -x CUDA_VERSION -x PATH -x HTTP_PROXY -x NV_LIBNPP_DEV_PACKAGE -x API_SERVER_PORT -x NV_CUDNN_PACKAGE_NAME -x PRIMUS_ROLE_CATEGORY -x YARN_CLASS_ID -x LIBHDFS_OPTS -x ENV_DOCKER_CONTAINER_SECURITY_OPTION -x NV_LIBNCCL_DEV_PACKAGE_NAME -x ENABLE_OOM_LISTENER -x NM_PORT -x API_SERVER_HOST -x NCCL_VERSION -x NM_HTTP_PORT -x NV_LIBNCCL_PACKAGE_VERSION -x YARN_APP_PRIORITY -x YARN_APP_TYPE -x START_STATISTIC_STEP -x NVIDIA_DRIVER_CAPABILITIES -x TZ -x SHUFFLE_DISK_MANAGER_PORT -x YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS -x NM_AUX_SERVICE_mapreduce_shuffle -x SEC_KV_AUTH -x TF_SCRIPT -x CLASSPATH -x LOCAL_DIRS -x HADOOP_YARN_HOME -x NV_LIBCUBLAS_DEV_VERSION -x HADOOP_CONF_DIR -x NO_PROXY -x LIBRARY_PATH -x NV_LIBNPP_PACKAGE -x PRIMUS_EXECUTOR_UNIQUE_ID -x PRIMUS_AM_RPC_HOST -x NV_NVPROF_DEV_PACKAGE -x NV_NVML_DEV_VERSION -x YARN_CONTAINER_RESOURCE_PREFIX_MEMORY_MB -x YARN_CONTAINER_RESOURCE_PREFIX_YARN_IO_TPU_V3_BASE -x NV_CUDA_LIB_VERSION -x RUNTIME_IDC_NAME -x TF_CONFIG -x YARN_APP_TAGS -x NV_LIBCUBLAS_DEV_PACKAGE -x LC_CTYPE -x NVARCH -x NV_CUDA_CUDART_DEV_VERSION -x NLSPATH -x ENV_DOCKER_CONTAINER_SHM_SIZE -x SHLVL -x TF_WORKSPACE -x JEMALLOC_PATH -x XFILESEARCHPATH -x SPARK_3_SHUFFLE_SERVICE_PORT -x NV_LIBCUBLAS_PACKAGE_NAME -x NM_HOST -x PRIMUS_SUBMIT_TIMESTAMP -x STOP_STATISTIC_STEP -x PYTHONPATH -x NV_LIBNCCL_PACKAGE_NAME -x YARN_QUEUE_ID -x ENV_DOCKER_CONTAINER_DEVICE -x ROLES_LIST -x YARN_USER -x LOAD_SERVICE_PSM -x YARN_CONTAINER_RESOURCE_PREFIX_YARN_IO_GPU -x PRIMUS_EXECUTOR_UNIQID -x NV_NVPROF_VERSION -x JAVA_HOME -x NVIDIA_REQUIRE_CUDA -x YARN_CONTAINER_RUNTIME_TYPE -x SPARK_SHUFFLE_SERVICE_PORT -x ENV_DOCKER_CONTAINER_CAP_ADD -x MALLOC_ARENA_MAX -x SSD_MANAGER_PORT -x YARN_QUEUE_NAME -x NV_NVTX_VERSION -x YODEL_MODE -x NV_CUDA_CUDART_VERSION -x BYTED_HOST_IPV6 -x NV_CUDA_COMPAT_PACKAGE -x LD_LIBRARY_PATH -x HADOOP_TOKEN_FILE_LOCATION -x LOG_DIRS -x APPLICATION_ID -x HOME -x NV_LIBCUSPARSE_VERSION -x HADOOP_COMMON_HOME -x HADOOP_HDFS_HOME -x OLDPWD -x NV_LIBNCCL_PACKAGE -x MEM_USAGE_STRATEGY -x PWD -x NV_LIBCUBLAS_VERSION -x ENV_DOCKER_CONTAINER_ULIMIT -x LOGNAME -x NV_CUDNN_PACKAGE -x PRIMUS_STAGING_DIR -x NV_LIBNCCL_DEV_PACKAGE -x NVIDIA_VISIBLE_DEVICES -x NV_LIBNPP_VERSION -x YARN_CONTAINER_RESOURCE_PREFIX_VCORES_MILLI -x HADOOP_HOME -x CORE_DUMP_PROC_NAME -x NV_CUDNN_PACKAGE_DEV -x USER python3 train.py --output_dir=hdfs://harunava/user/xxx/deeprec_v10 --data_location=hdfs://harunava/user/xxx/criteo_small --protocol=grpc --smartstaged=false --batch_size=2048 --steps=30000 --ev=true --ev_elimination=l2 --ev_filter=counter --op_fusion=true --input_layer_partitioner=0 --dense_layer_partitioner=16 --group_embedding=collective --workqueue=true --parquet_dataset=false

Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

wangcaihua avatar Apr 27 '23 07:04 wangcaihua

@Mesilenceki @shijieliu

liutongxuan avatar May 26 '23 14:05 liutongxuan

这是来自QQ邮箱的假期自动回复邮件。   您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。

wangcaihua avatar May 26 '23 14:05 wangcaihua