oneAPI-samples icon indicating copy to clipboard operation
oneAPI-samples copied to clipboard

IntelPyTorch_TorchCCL_Multinode_Training testing DLRM with aikit docker pytorch env saw Segment Fault

Open xuechendi opened this issue 3 years ago • 2 comments

Summary

Saw segment fault when testing AIKit docker.io/intel/oneapi-aikit:latest with DLRM training Can any one give some insight? Is there any one we can approach internally?

Version

DLRM training is runnable with

  • g++8.4
  • torch 1.5.0a0+b58f89b tags/v1.5.0-rc3
  • intel-extension-for-pytorch 0.1 checkout tags/v0.2
  • oneCCL 2021.1-beta07-1
  • torch_ccl 1.0 2021.1-beta07-1

DLRM will be Failed (segment fault) with docker.io/intel/oneapi-aikit:latest pytorch env versions are

  • g++ 7.5
  • torch 1.8.0a0+37c1f4a
  • intel-extension-for-pytorch 1.8.0
  • oneccl 2021.4
  • torch_ccl 1.1.0+064d9eb

Environment

docker.io/intel/oneapi-aikit:latest pytorch env

Reproduce

https://github.com/mlperf/training_results_v0.7/tree/master/Intel/benchmarks/dlrm/1-node-4s-cpx-pytorch

Observed behavior

image

Core dump file

core_dump.zip

xuechendi avatar Nov 10 '21 09:11 xuechendi

@jingxu10 Could you address it?

louie-tsai avatar Jul 23 '22 00:07 louie-tsai

This is out of date. Please try DLRM in Intel AI model zoo at https://github.com/IntelAI/models/tree/pytorch-r1.10-models/quickstart/recommendation/pytorch/dlrm/training/cpu

jingxu10 avatar Jul 23 '22 12:07 jingxu10

@xuechendi please follow the feedback from jing and try on below link? https://github.com/IntelAI/models/tree/master/quickstart/recommendation/pytorch/dlrm/training/cpu

In the meantime, you might need to setup env following on the README in below link instead. MLperf might requires some additional packages or specific version of packages. https://github.com/mlperf/training_results_v0.7/tree/master/Intel/benchmarks/dlrm/1-node-4s-cpx-pytorch

aice-support avatar Jun 06 '23 16:06 aice-support

solution provided.

aice-support avatar Jun 06 '23 16:06 aice-support