DeepRec icon indicating copy to clipboard operation
DeepRec copied to clipboard

Segment error when using EV during training

Open A-Wanderer opened this issue 2 years ago • 0 comments

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04.5 LTS
  • Docker Images: alideeprec/deeprec-tianchi-bazel-cache:deeprec-cpu-py36-ubuntu18.04
  • TensorFlow installed from (source or binary): Blade source
  • TensorFlow version (use command below): Current latest version (8.16, 20d12f0, [OneDNN] Update oneDNN......)
  • Python version: 3.6.12
  • Bazel version (if compiling from source): Build label: 0.26.1
  • GCC/Compiler version (if compiling from source): gcc version 7.5.0

Describe the expected behavior 使用当前最新版本源码[20d12f0]编译后,使用EV功能测试modelzoo代码时会出现段错误,命令为[ python /DeepRec/modelzoo/$model/train.py --step=401 --ev=true --emb_fusion=false ],这对于多个模型样例都会出现,例如DIN,DLRM,DSSM等

gdb显示为:

#0  _mm_store_ps (__A=..., __P=0x7fbbec18a288)
    at /usr/lib/gcc/x86_64-linux-gnu/7/include/xmmintrin.h:976
#1  Eigen::internal::pstore<float, float __vector(4)>(float*, float __vector(4) const&) (
    from=..., to=0x7fbbec18a288)
    at external/eigen_archive/Eigen/src/Core/arch/SSE/PacketMath.h:538
#2  Eigen::internal::pstoret<float, float __vector(4), 16>(float*, float __vector(4) const&) (from=..., to=0x7fbbec18a288)
    at external/eigen_archive/Eigen/src/Core/GenericPacketMath.h:659
#3  Eigen::TensorEvaluator<Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 16, Eigen::MakePointer>, Eigen::DefaultDevice>::writePacket<16>(long, float __vector(4) const&) (
    this=0x7fbbd3ffd8b0, x=..., index=0)
    at external/eigen_archive/unsupported/Eigen/CXX11/src/Tensor/TensorEvaluator.h:125
---Type <return> to continue, or q <return> to quit---
#4  Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_sum_op<float, float>, Eigen::TensorCwiseUnaryOp<Eigen::internal::bind2nd_op<Eigen::internal::scalar_product_op<float, float> >, Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 16, Eigen::MakePointer> const> const, Eigen::TensorCwiseUnaryOp<Eigen::internal::bind2nd_op<Eigen::internal::scalar_product_op<float const, float const> >, Eigen::TensorChippingOp<0l, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const> const> const> const> const, Eigen::DefaultDevice>::evalPacket (i=0, this=0x7fbbd3ffd8b0)
    at external/eigen_archive/unsupported/Eigen/CXX11/src/Tensor/TensorAssign.h:178
#5  Eigen::internal::TensorExecutor<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_sum_op<float, float>, Eigen::TensorCwiseUnaryOp<Eigen::internal::bind2nd_op<Eigen::internal::scalar_product_op<float, float> >, Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 16, Eigen::MakePointer> const> const, Eigen::TensorCwiseUnaryOp<Eigen::internal::bind2nd_op<Eigen::internal::scalar_product_op<float const, float const> >, Eigen::TensorChippingOp<0l, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const> const> const> const> const, Eigen::DefaultDevice, true, (Eigen::internal::TiledEvaluation)0>::run (expr=..., device=...)
    at external/eigen_archive/unsupported/Eigen/CXX11/src/Tensor/TensorExecutor.h:144
#6  0x00007fbce441b4cf in Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 16, Eigen::MakePointer>::operator=<Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_sum_op<float, float>, Eigen::TensorCwiseUnaryOp<Eigen::internal::bind2nd_op<Eigen::internal::scalar_product_op<float, float> >, Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 16, Eigen::MakePoi---Type <return> to continue, or q <return> to quit---
nter> const> const, Eigen::TensorCwiseUnaryOp<Eigen::internal::bind2nd_op<Eigen::internal::scalar_product_op<float const, float const> >, Eigen::TensorChippingOp<0l, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const> const> const> >
    (other=..., this=0x7fbbd3ffd9f0)
    at external/eigen_archive/unsupported/Eigen/CXX11/src/Tensor/TensorMap.h:332
#7  tensorflow::KvSparseApplyAdamAsyncOp<Eigen::ThreadPoolDevice, float, long long, long long>::Compute(tensorflow::OpKernelContext*)::{lambda(long long, long long)#2}::operator()(long long, long long) const (__closure=0x7fbb94e20630, start_i=<optimized out>, 
    limit_i=40) at tensorflow/core/kernels/training_ali_ops.cc:2131
#8  0x00007fbcdc69534e in std::_Function_handler<void (long, long), tensorflow::thread::ThreadPool::ParallelFor(long long, long long, std::function<void (long long, long long)>)::{lambda(long, long)#1}>::_M_invoke(std::_Any_data const&, long&&, std::_Any_data const&)
    ()
   from /home/pai/lib/python3.6/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#9  0x00007fbcdc695df9 in std::_Function_handler<void (long, long), Eigen::ThreadPoolDevice::parallelFor(long, Eigen::TensorOpCost const&, std::function<long (long)>, std::function<void (long, long)>) const::{lambda(long, long)#1}>::_M_invoke(std::_Any_data const&, long&&, std::_Any_data const&) ()
   from /home/pai/lib/python3.6/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1
........

A-Wanderer avatar Aug 16 '22 16:08 A-Wanderer