oneAPI-samples
oneAPI-samples copied to clipboard
IntelPyTorch_TorchCCL_Multinode_Training testing DLRM with aikit docker pytorch env saw Segment Fault
Summary
Saw segment fault when testing AIKit docker.io/intel/oneapi-aikit:latest with DLRM training Can any one give some insight? Is there any one we can approach internally?
Version
DLRM training is runnable with
- g++8.4
- torch 1.5.0a0+b58f89b tags/v1.5.0-rc3
- intel-extension-for-pytorch 0.1 checkout tags/v0.2
- oneCCL 2021.1-beta07-1
- torch_ccl 1.0 2021.1-beta07-1
DLRM will be Failed (segment fault) with docker.io/intel/oneapi-aikit:latest pytorch env versions are
- g++ 7.5
- torch 1.8.0a0+37c1f4a
- intel-extension-for-pytorch 1.8.0
- oneccl 2021.4
- torch_ccl 1.1.0+064d9eb
Environment
docker.io/intel/oneapi-aikit:latest pytorch env
Reproduce
https://github.com/mlperf/training_results_v0.7/tree/master/Intel/benchmarks/dlrm/1-node-4s-cpx-pytorch
Observed behavior
Core dump file
@jingxu10 Could you address it?
This is out of date. Please try DLRM in Intel AI model zoo at https://github.com/IntelAI/models/tree/pytorch-r1.10-models/quickstart/recommendation/pytorch/dlrm/training/cpu
@xuechendi please follow the feedback from jing and try on below link? https://github.com/IntelAI/models/tree/master/quickstart/recommendation/pytorch/dlrm/training/cpu
In the meantime, you might need to setup env following on the README in below link instead. MLperf might requires some additional packages or specific version of packages. https://github.com/mlperf/training_results_v0.7/tree/master/Intel/benchmarks/dlrm/1-node-4s-cpx-pytorch
solution provided.