Paddle icon indicating copy to clipboard operation
Paddle copied to clipboard

[mkldnn] Fix elementwise_sub sign reverse for mkldnn

Open zh794390558 opened this issue 2 years ago • 6 comments

PR types

Bug fixes

PR changes

OPs

Describe

  • Fixed elementwise_sub mkldnn kernel BUG while specifying x.shape as [180, 1] and y.shape as [1, 256] . In this case, the sign of some numbers is wrong.

  • But we found that it raised dnnl::error, could not create a primitive descriptor for a reorder primitive error after we added unittest test_check_grad_ignore_x and test_check_grad_normal in TestMKLDNNElementwiseSubOp40. The log information with GLOG_v=6 as follows:

    I0914 12:02:47.651978 11870 operator.cc:277] Place(cpu) Op(elementwise_sub_grad), inputs:{Out@GRAD[Out@GRAD:float[180, 256]({})(Place(cpu))], X[X:unknown_dtype[180, 1]({})(Place(cpu))], Y[Y:unknown_dtype[1, 256]({})(Place(cpu))]}, outputs:{X@GRAD[], Y@GRAD[Y@GRAD:float[1, 256]({})(Place(cpu))]}.
    I0914 12:02:47.652005 11870 operator.cc:210] Place(cpu) Op(fetch), inputs:{X[Y@GRAD:float[1, 256]({})(Place(cpu))]}, outputs:{Out[fetch:[-1]({{}})()]}.
    I0914 12:02:47.652037 11870 fetch_op.cc:109] Fetch variable Y@GRAD to variable fetch's 0 column.
    I0914 12:02:47.652055 11870 fetch_op.cc:36] innerTransDataLayoutFromMKLDNN
    I0914 12:02:47.652062 11870 device_context.cc:80] DeviceContextPool Get: Place(cpu)
    I0914 12:02:47.652086 11870 allocator_facade.cc:343] GetAllocator Place(cpu) 1024
    W0914 12:02:47.652269 11870 operator.cc:284] fetch raises an exception dnnl::error, could not create a primitive descriptor for a reorder primitive
    I0914 12:02:47.652379 11870 executor.cc:69] destroy ExecutorPrepareContext
    
  • Moreover, if we reversed the shape of x and y, such as [1, 180] [256, 1] , it still failed and raised error,such as element_sub_grad reshape error

How to reproduce ?

  • step 1: git fetch this PR
  • step 2: cd Paddle & mkdir build & cd build
  • step 3: build paddle
#!/bin/bash

cmake .. -DWITH_GPU=ON \  
        -DPY_VERSION=3.8 \
        -DWITH_DISTRIBUTE=OFF  \
        -DCMAKE_BUILD_TYPE=Release \
        -DWITH_MKL=ON \
        -DWITH_MKLDNN=ON

#make -j$(nproc)
make -j24

python -m pip uninstall paddlepaddle-gpu -y
python -m pip install python/dist/paddlepaddle_gpu-0.0.0-cp38-cp38-linux_x86_64.whl
  • step 4: run this unittest
cd Paddle/python/paddle/fluid/tests/unittests

python test_elementwise_sub_op.py TestMKLDNNElementwiseSubOp40

# or run single test case
GLOG_v=6 python test_elementwise_sub_op.py TestMKLDNNElementwiseSubOp40.test_check_grad_ignore_x > err.log 2>&1

zh794390558 avatar Sep 14 '22 12:09 zh794390558

@piotrekobi Please review

jczaja avatar Sep 15 '22 08:09 jczaja

@zh794390558 We are looking at this PR and testing it against our workloads. We should have results by the end of this week. We will write here if all is fine or any problems found.

jczaja avatar Sep 15 '22 08:09 jczaja

Fix as comment, please check it.

zh794390558 avatar Sep 15 '22 12:09 zh794390558

updating on behalf of @piotrekobi - we found big accuracy drop for a model caused by this PR in our test... currently we are re-testing to ensure it is not a false alert. Stay tuned.

yaomichael avatar Sep 20 '22 03:09 yaomichael

updating on behalf of @piotrekobi - we found big accuracy drop for a model caused by this PR in our test... currently we are re-testing to ensure it is not a false alert. Stay tuned.

can i merge this PR? I need this feature with other to do other work, thx.

zh794390558 avatar Sep 21 '22 02:09 zh794390558

Yes. The drop turned out to be caused by some different change. I'm sorry for the trouble.

piotrekobi avatar Sep 21 '22 07:09 piotrekobi