pytorch icon indicating copy to clipboard operation
pytorch copied to clipboard

Make the Index Rounding Mode Consistent Between the 2D and 3D GridSample Interpolations

Open leimao opened this issue 1 year ago • 3 comments

The 2D GridSample rounds index to the nearest even using std::nearbyint whereas the 3D GridSample rounds index away from zero using std::round. This discrepancy needs to be resolved.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

leimao avatar Mar 17 '23 04:03 leimao

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/97000

Note: Links to docs will display an error until the docs builds have been completed.

:heavy_exclamation_mark: 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

:white_check_mark: No Failures

As of commit e80b3108630970b14574b72a5a4a05d38bb8a612: :green_heart: Looks good so far! There are no failures yet. :green_heart:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar Mar 17 '23 04:03 pytorch-bot[bot]

Can anyone please tell me what's causing the only failed test case? Is it due to the change or testing infrastructure instability? Does this specific test passed on other CUDA platforms?

leimao avatar Mar 17 '23 15:03 leimao


2023-03-17T07:16:23.4046858Z [ RUN      ] NVFuserTest.FusionBinaryOps_CUDA
2023-03-17T07:16:28.6430919Z unknown file: Failure
2023-03-17T07:16:28.6431496Z C++ exception with description "aten_output_tensor.allclose( fusion_output_tensor.to(aten_output_tensor.dtype()), tolerance_values.second, tolerance_values.first) INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/third_party/nvfuser/test/test_gpu_validator.h":419, please report a bug to PyTorch. 
2023-03-17T07:16:28.6433824Z Operation remainder
2023-03-17T07:16:28.6435225Z Validation error in output 0 on line 3848 in file /var/lib/jenkins/workspace/third_party/nvfuser/test/test_gpu1.cpp.
2023-03-17T07:16:28.6436117Z   Detected abs error of: nan
2023-03-17T07:16:28.6436912Z     absolute tolerance was set to 1.68222e-06
2023-03-17T07:16:28.6437738Z     and relative tolerance set to 2.23704e-06
2023-03-17T07:16:28.6438680Z Exception raised from testValidate at /var/lib/jenkins/workspace/third_party/nvfuser/test/test_gpu_validator.h:419 (most recent call first):
2023-03-17T07:16:28.6439916Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7f777b595c1b in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin/libc10.so)
2023-03-17T07:16:28.6440789Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xbf (0x7f777b590b4f in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin/libc10.so)
2023-03-17T07:16:28.6442061Z frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x4e (0x7f777b59174e in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin/libc10.so)
2023-03-17T07:16:28.6442789Z frame #3: <unknown function> + 0x1a1407 (0x55c44a32f407 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin/nvfuser_tests)
2023-03-17T07:16:28.6443718Z frame #4: torch::jit::NVFuserTest_FusionBinaryOps_CUDA_Test::TestBody() + 0x1813 (0x55c44a3361d3 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin/nvfuser_tests)
2023-03-17T07:16:28.6444738Z frame #5: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 0x4a (0x55c44a59d2aa in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin/nvfuser_tests)
2023-03-17T07:16:28.6445390Z frame #6: <unknown function> + 0x3fb43f (0x55c44a58943f in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin/nvfuser_tests)
2023-03-17T07:16:28.6445910Z frame #7: <unknown function> + 0x3fb722 (0x55c44a589722 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin/nvfuser_tests)
2023-03-17T07:16:28.6446405Z frame #8: <unknown function> + 0x3fbf95 (0x55c44a589f95 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin/nvfuser_tests)
2023-03-17T07:16:28.6446965Z frame #9: testing::internal::UnitTestImpl::RunAllTests() + 0xe2f (0x55c44a5952bf in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin/nvfuser_tests)
2023-03-17T07:16:28.6447516Z frame #10: testing::UnitTest::Run() + 0x98 (0x55c44a5956b8 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin/nvfuser_tests)
2023-03-17T07:16:28.6448010Z frame #11: main + 0xfb (0x55c44a2d0adb in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin/nvfuser_tests)
2023-03-17T07:16:28.6448444Z frame #12: __libc_start_main + 0xe7 (0x7f777abb6c87 in /lib/x86_64-linux-gnu/libc.so.6)
2023-03-17T07:16:28.6448891Z frame #13: _start + 0x2a (0x55c44a3026ba in /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin/nvfuser_tests)
2023-03-17T07:16:28.6449184Z " thrown in the test body.
2023-03-17T07:16:28.6449507Z [  FAILED  ] NVFuserTest.FusionBinaryOps_CUDA (5238 ms)

leimao avatar Mar 17 '23 15:03 leimao

@mikaylagawarecki @albanD @jbschlosser Could you please give us some feedback on this? We have another ONNX PR that expands the ONNX GridSample to N dimensional space. If we want to hide the rounding mode detail from the user, I think we should make the rounding mode consistent between different ND interpolations and different frameworks. Thanks.

leimao avatar Mar 27 '23 20:03 leimao

@mikaylagawarecki This is indeed a backward-compatibility breaking change.

Although I can triple-check, I think I have checked the rounding places you pointed out previously and they all ultimately use the Aten round implementation which is the std::nearbyint I mentioned earlier.

In the ONNX PR I submitted, there are CPU implementations for GridSample using different rounding modes. I guess we can use it as the reference for checking the PyTorch implementations.

Now my major concern is, what's causing the CI failure? Could you please take a look? Thank you.

leimao avatar Mar 28 '23 19:03 leimao

The CI failure looks unrelated to me, I just reran it.

Could you add some tests to make sure the rounding behavior across gridsampling ops for 0.5 is consistent please

mikaylagawarecki avatar Mar 28 '23 20:03 mikaylagawarecki

@mikaylagawarecki Could you please point me to the location where the tests should be?

leimao avatar Mar 28 '23 21:03 leimao

Of course, there are existing tests for grid_sample in test/test_nn.py, you can add a test there Let me know if you have any questions!

Edit: @leimao before you do this, heads up that we might need to have a more formal deprecation process for this change since it silently breaks BC (see here). Apologies for the churn, I will get back to you by EOW on this

mikaylagawarecki avatar Mar 29 '23 14:03 mikaylagawarecki

Apologize. I did not operate Git correctly and there was a patch that submitted some PyTorch old commits and automatically tagged lots of people :(

leimao avatar Apr 02 '23 17:04 leimao

@mikaylagawarecki I created one unit test and realized that some are some limitation of testing the rounding mode. Please check the new file changes and the updated pull request message.

leimao avatar Apr 02 '23 18:04 leimao

@mikaylagawarecki We also need to think about which rounding mode should be used for index rounding. The current PR uses std::nearbyint instead of std::round because the this minimizes the number of changes we have to make. However, this might not be consistent with the CV community conventions (someone has to double-check).

The unit test added, however, is agnostic to the rounding mode. As long as all the implementations use the same rounding mode, it will pass.

leimao avatar Apr 02 '23 18:04 leimao

@mikaylagawarecki I also need some feedback from you and your colleagues on this: https://github.com/pytorch/pytorch/pull/97000#issuecomment-1493410808 since we only want to make this change once.

CC my colleague @yuanyao-nv for feedback as well as he's mainly responsible for managing the ONNX contributions.

leimao avatar Apr 03 '23 22:04 leimao

@mikaylagawarecki I also need some feedback from you and your colleagues on this: #97000 (comment) since we only want to make this change once.

CC my colleague @yuanyao-nv for feedback as well as he's mainly responsible for managing the ONNX contributions.

I checked the PyTorch interpolate function, the OpenCV resize function, and thought about it twice. Those two functions created a mapping between unnormalized (floating point) pixel coordinates whereas the PyTorch grid_sample function created a mapping from normalized pixel coordinates to unnormalized (floating point) pixel coordinates. That is to say, there is no reference for how to round the unnormalized (floating point) pixel coordinates in our case.

Frankly speaking, in my own opinion, rounding the indices to the nearest even, which is used in the PyTorch TOT grid sample 2D implementation does seem weird to me, because, for example, the index rounding behavior is inconsistent for 1.5 (rounding to 2) and 2.5 (rounding to 2).

leimao avatar Apr 04 '23 05:04 leimao

I checked the PyTorch interpolate function, the OpenCV resize function, and thought about it twice. Those two functions created a mapping between unnormalized (floating point) pixel coordinates whereas the PyTorch grid_sample function created a mapping from normalized pixel coordinates to unnormalized (floating point) pixel coordinates. That is to say, there is no reference for how to round the unnormalized (floating point) pixel coordinates in our case.

Frankly speaking, in my own opinion, rounding the indices to the nearest even, which is used in the PyTorch TOT grid sample 2D implementation does seem weird to me, because, for example, the index rounding behavior is inconsistent for 1.5 (rounding to 2) and 2.5 (rounding to 2).

Got it. Since PyTorch's implementation of round has this round to even behavior, I'm inclined to think that std::nearbyint would be consistent here.

cc @jbschlosser and @albanD for further thoughts

mikaylagawarecki avatar Apr 04 '23 16:04 mikaylagawarecki

@mikaylagawarecki Thanks for the approval. Could you please re-trigger the CI? @jbschlosser and @albanD Do you have any opinions? Once this PR is merged, the ONNX reference implementation will use the new GridSample implementation as reference. Thanks.

leimao avatar Apr 05 '23 17:04 leimao

Yes that looks good to me!

albanD avatar Apr 05 '23 17:04 albanD

@pytorchbot merge

mikaylagawarecki avatar Apr 05 '23 18:04 mikaylagawarecki

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging Check the merge workflow status here

pytorchmergebot avatar Apr 05 '23 18:04 pytorchmergebot