ivy
ivy copied to clipboard
fix: torch backend- gather fix
PR Description
Previous implementation of gather() from torch backend was incorrect but did not get caught by pytest. I corrected a vast majority of issues. Pytest might still show some errors for edge cases due to the very limited implementation of torch.gather() and various workarounds needed to recreate tensorflow's ground truth behaviour, they'll likely be related to dimensionality and I'm still removing them (it's all in the conditions for reshaping and expanding). Side note: the official tensorflow's documentation for gather contains errors in the expression to calculate the dimensions of the output.
Related Issue
Closes https://github.com/unifyai/ivy/issues/26498
Checklist
- [x] Did you add a function?
- [ ] Did you add the tests?
- [ ] Did you run your tests and are your tests passing?
- [x] Did pre-commit not fail on any check?
- [x] Did you follow the steps we provided?
Socials
Reformatting Task Checklist
IMPORTANT NOTICE 🚨:
The Ivy Docs represent the ground truth for the task descriptions and this checklist should only be used as a supplementary item to aid with the review process.
LEGEND 🗺:
- ❌ : Check item is not completed.
- ✅ : Check item is ready for review.
- 🆘 : Stuck/Doubting implementation (PR author should add comments explaining why).
- ⏩ : Check is not applicable to function (skip).
- 🆗 : Check item is already implemented and does not require any edits.
CHECKS 📑:
-
- [ ] ✅: Remove all lambda and direct bindings for the backend functions in:
-
- [ ] 🆗: Implement the following if they don't exist:
-
- [ ] 🆗: The
ivy.Array
instance method in ivy/data_classes/array/general.py.
- [ ] 🆗: The
-
- [ ] 🆗: The
ivy.Array
special method in ivy/data_classes/array/array.py.
- [ ] 🆗: The
-
- [ ] ⏩: The
ivy.Array
reverse special method in ivy/data_classes/array/array.py.
- [ ] ⏩: The
-
- [ ] 🆗: The
ivy.Container
static method in ivy/data_classes/container/general.py.
- [ ] 🆗: The
-
- [ ] 🆗: The
ivy.Container
instance method in ivy/data_classes/container/general.py.
- [ ] 🆗: The
-
- [ ] 🆗: The
ivy.Container
special method in ivy/data_classes/container/container.py.
- [ ] 🆗: The
-
- [ ] ⏩: The
ivy.Container
reverse special method in ivy/data_classes/container/container.py.
- [ ] ⏩: The
-
- [ ] 🆗: Implement the following if they don't exist:
-
- [ ] 🆗: Make sure that the aforementioned methods are added into the correct category-specific parent class, such as
ivy.ArrayWithElementwise
,ivy.ContainerWithManipulation
etc.
- [ ] 🆗: Make sure that the aforementioned methods are added into the correct category-specific parent class, such as
-
- [ ] 🆗: Correct all of the Function Arguments and the type hints for every function and its relevant methods, including those you did not implement yourself.
-
- [ ] 🆗: Add the correct Docstrings to every function and its relevant methods, including those you did not implement yourself. The following should be added:
-
- [ ] 🆗: The function's Array API standard description in ivy/functional/general.py. If the function is not part of the Array API standard then a description of similar style should be added to the same file.
The following modifications should be made to the description:
- [ ] 🆗: Remove type definitions in the
Parameters
andReturns
sections. - [ ] 🆗: Add
out
to theParameters
section if function accepts anout
argument. - [ ] 🆗: Replace
out
withret
in theReturns
section.
- [ ] 🆗: Remove type definitions in the
- [ ] 🆗: The function's Array API standard description in ivy/functional/general.py. If the function is not part of the Array API standard then a description of similar style should be added to the same file.
The following modifications should be made to the description:
-
- [ ] 🆗: Reference to docstring for ivy.function_name (5.a) for the function description and modified
Parameters
andReturns
sections as described in the docs in:- [ ] 🆗: ivy/array/general.py.
- [ ] 🆗: ivy/container/general.py (in the static and instance method versions).
- [ ] ⏩: ivy/array/array.py if the function has a special method ( like
__function_name__
). - [ ] ⏩: ivy/array/array.py if the function has a reverse special method ( like
__rfunction_name__
). - [ ] ⏩: ivy/container/container.py if the function has a special method ( like
__function_name__
). - [ ] ⏩: ivy/container/container.py if the function has a reverse special method ( like
__rfunction_name__
).
- [ ] 🆗: Reference to docstring for ivy.function_name (5.a) for the function description and modified
-
- [ ] 🆗: Add the correct Docstrings to every function and its relevant methods, including those you did not implement yourself. The following should be added:
-
-
[ ] 🆗: Add thorough Docstring Examples for every function and its relevant methods and ensure they pass the docstring tests.
Functional Examples in ivy/functional/general.py.
-
- [ ] 🆗: Cover all possible variants for each of the arguments independently (not combinatorily).
-
- [ ] 🆗: Vary the values and input shapes considerably between examples.
-
- [ ] 🆗: Start out simple and get more complex with each example.
-
- [ ] 🆗: Show an example with:
- [ ] 🆗:
out
unused. - [ ] 🆗:
out
used to update a new array y. - [ ] 🆗:
out
used to inplace update the input array x (if x has the same dtype and shape as the return).
- [ ] 🆗:
- [ ] 🆗: Show an example with:
-
- [ ] 🆗: If broadcasting is relevant for the function, then show examples which highlight this.
Nestable Function Examples in ivy/functional/general.py. Only if the function supports nestable operations.
-
- [ ] 🆗: Add an example that passes in an
ivy.Container
instance in place of one of the arguments.
- [ ] 🆗: Add an example that passes in an
-
- [ ] 🆗: Add an example passes in
ivy.Container
instances for multiple arguments.
- [ ] 🆗: Add an example passes in
Container Static Method Examples in ivy/container/general.py.
-
- [ ] 🆗: The example from point (6.f) should be replicated, but added to the
ivy.Container
static method docstring in withivy.<func_name>
replaced withivy.Container.static_<func_name>
in the example.
- [ ] 🆗: The example from point (6.f) should be replicated, but added to the
-
- [ ] 🆗: The example from point (6.g) should be replicated, but added to the
ivy.Container
static method docstring, withivy.<func_name>
replaced withivy.Container.static_<func_name>
in the example.
- [ ] 🆗: The example from point (6.g) should be replicated, but added to the
Array Instance Method Example in ivy/array/general.py.
-
- [ ] 🆗: Call this instance method of the
ivy.Array
class.
- [ ] 🆗: Call this instance method of the
Container Instance Method Example in ivy/container/general.py.
-
- [ ] 🆗: Call this instance method of the
ivy.Container
class.
- [ ] 🆗: Call this instance method of the
Array Operator Examples in ivy/array/array.py.
-
- [ ] ⏩: Call the operator on two
ivy.Array
instances.
- [ ] ⏩: Call the operator on two
-
- [ ] ⏩: Call the operator with an
ivy.Array
instance on the left andivy.Container
on the right.
- [ ] ⏩: Call the operator with an
Array Reverse Operator Example in ivy/array/array.py.
-
- [ ] ⏩: Call the operator with a
Number
on the left and anivy.Array
instance on the right.
- [ ] ⏩: Call the operator with a
Container Operator Examples in ivy/container/container.py.
-
- [ ] ⏩: Call the operator on two
ivy.Container
instances containing Number instances at the leaves.
- [ ] ⏩: Call the operator on two
-
- [ ] ⏩: Call the operator on two
ivy.Container
instances containingivy.Array
instances at the leaves.
- [ ] ⏩: Call the operator on two
-
- [ ] ⏩: Call the operator with an
ivy.Container
instance on the left andivy.Array
on the right.
- [ ] ⏩: Call the operator with an
Container Reverse Operator Example in ivy/container/container.py.
-
- [ ] ⏩: Following example in the
ivy.Container.__radd__
docstring, with the operator called with aNumber
on the left and anivy.Container
instance on the right.
- [ ] ⏩: Following example in the
Tests
-
- [ ] ✅: Docstring examples tests passing.
-
- [ ] ✅: Lint checks passing.
-
-
@AnnaTz the pull request I was asked to tag you in- for torch gather. Pytest might still throw some issues for edge cases (mostly related to dimensionality), I'm still fixing them, but it seems that I've reached the point where it's all just related to finding the right conditions for reshaping and expanding params and indices before feeding them into the native torch.gather(). Some of those issues are related to errors I found in the tensorflow's official documentation- and tensorflow's outputs are used as the ground truth.
@AnnaTz ready to review! Probably even to merge (no errors), but due to my previous experiences with pytest for some reason failing to detect serious issues, I'd greatly appreciate it if you double-checked. However, the function could probably use some polishing to make it simpler, there might be some redundant conditions still- and that could probably use a look as well. On the plus side- ivy.gather() on torch backend finally uses torch methods instead of borrowing from some other libraries. The results of the command pytest -s -v --full-trace --debug=pytestdebug.log ivy_tests/test_ivy/test_functional/test_core/test_general.py::test_gather[cpu-torch-False-False]
are attached in the pytestdebug.log.
pytestdebug.log
UPDATE: I ran more tests with pytest, there indeed was an overlook- adjusting the dimensions to meet torch.gather() requirements requires also adjusting the axis (if params.shape() goes from torch.Size([2]) to torch.Size([1, 2]), then axis has to compensate for the insertion of that extra dimension)- I've already fixed that but not commited yet. I've already done that. There is also some small bug somewhere which does prevent torch.Tensor.expand() somewhere in the conditions- I'm looking into that.
Hey @saeedashrraf, could you please take a look at this PR, given that it's about gather? Thanks 😄
Hey @saeedashrraf, could you please take a look at this PR, given that it's about gather? Thanks 😄
I'm still removing some issues in the code but I'll appreciate any advice :) And there's still a possibility that it's not worth in this case to try to do it the way I'm trying and instead it might be better to revert gather to the version from some older commit. However, I still would like to get that piece of code to pass all the tests.
UPDATE: Ok, I think it can actually get significantly simplified in a way, which won't result in weird behaviour, I'm on it but it's a bit of rewriting, however the code should be much simpler.
UPDATE:
Seems to be fixed :) But I'd appreciate double-checking the pytest, because bugs have already slipped twice past it for me and I'm not sure why (it's probably related to the generation of Hypothesis examples when rerunning the pytest). I'm attaching my latest pytestdebug.log for you to check and compare to if you run your own. I ran the pytest just for the torch backend here: pytest -s -v --full-trace --debug=pytestdebug.log ivy_tests/test_ivy/test_functional/test_core/test_general.py::test_gather[cpu-torch-False-False]
pytestdebug.log
@saeedashrraf the PR is ready for review.
Hey @saeedashrraf, could you please take a look at this PR, given that it's about gather? Thanks 😄
Hey @vedpatwardhan - is there a chance for a review :) ? It seems that the person assigned to review this PR has not been active throughout the last month and the PR is probably ready to be merged (but a check through the code and a separate pytest would definitely not hurt).
Hey @vedpatwardhan - is there a chance for a review :) ? It seems that the person assigned to review this PR has not been active throughout the last month and the PR is probably ready to be merged (but a check through the code and a separate pytest would definitely not hurt).
Hey @Kacper-W-Kozdon, really sorry for the delay in reviewing. @Ishticode could you please take a quick look at the changes made if we can merge it? Thanks 😄
@reproduce_failure('6.78.1', b'AXicY2NkAAMoBaEZGRgZGZgYkABYGAACCAAQ')
Initially- I don't see anything that would cause a problem; I checked in a jupyter notebook (https://github.com/Kacper-W-Kozdon/notebook-testing-ivy/blob/main/Testing_Python.ipynb), under the evaluated input In[3]. I'll check with the decorator and in the ivy_dev venv, but at the very least it's nothing obvious and at least the notebook doesn't throw any errors.
Hi @Kacper-W-Kozdon Hi @vedpatwardhan Thank you so much for all the effort. Will keep your PR on priority. It fixes several issues but i ran
test_gather
with 200 examples on our torch backend from your branch. It seems to fail with following value mismatch.E AssertionError: the results from backend torch and ground truth framework tensorflow do not match E [1.]!=[0.9995117] E E The mismatching elements are at `False` indices: E E [False] E E E Falsifying example: test_gather( E backend_fw='torch', E on_device='cpu', E params_indices_others=(['float16', 'int32'], E array([-1.], dtype=float16), E array([[0, 0, 0], E [0, 0, 0]], dtype=int32), E 0, E 0), E fn_name='gather', E test_flags=FunctionTestFlags( E ground_truth_backend='tensorflow', E num_positional_args=2, E with_out=False, E with_copy=False, E instance_method=False, E test_gradients=True, E test_trace=False, E transpile=False, E as_variable=[False], E native_arrays=[False], E container=[False], E precision_mode=False, E test_cython_wrapper=False, E ), E ) E E You can reproduce this example by temporarily adding @reproduce_failure('6.78.1', b'AXicY2NkAAMoBaEZGRgZGZgYkABYGAACCAAQ') as a decorator on your test case
Do you have any ideas where this is emerging from and are you able to fix? Thank for the effort and let me know if something is unclear.
I'm guessing that the assertion error comes from the flag test_gradients=True
? Could it be related to rtol
or atol
params not being set in the test? Do these two params also get passed to the gradient test? If they are, I think that this might be the case, since the raw outputs of the gather() function seem to be ok. I don't have enough experience to suggest good values for rtol
and atol
but in the Deep Dive they are set as 10^(-3). I will also double check how dtypes are treated in the native torch.gather(), I don't see any reason for other torch functions to affect types- though this one shouldn't either.
I ran the test with an example decorator added between @handle_test() and def test_gather()
@example(params_indices_others=(['float16', 'int32'], ivy.array([-1.]), ivy.array([[0, 0, 0], [0, 0, 0]]), 0, 0), test_flags=FunctionTestFlags( ground_truth_backend='tensorflow', num_positional_args=2, with_out=False, with_copy=False, instance_method=False, test_gradients=True, test_trace=False, transpile=False, as_variable=[False], native_arrays=[False], container=[False], precision_mode=False, test_cython_wrapper=False, ), fn_name="gather", )
,
however, I got a warning: "ivy_tests/test_ivy/test_functional/test_core/test_general.py::test_gather[cpu-torch-False-False] SKIPPED (Function not implemented in backend.)", even though the test is meant to be for the functional/backend. So I'm not 100% sure this example was taken into account. The tests passed 100% (default number of examples: 25).
pytestdebug.log
Hey @Kacper-W-Kozdon Thank you for the suggestions. I don't think gradient tests should be disabled. and even with tolerance added (i've set 1e-3, try to change around if you like) we get significant value mismatches. For examples one i got was
E AssertionError: the results from backend torch and ground truth framework tensorflow do not match E [[0.20898438 0.12158203 0.12109375 0.04736328 0. ] E [0.20898438 0.12158203 0.12109375 0.04736328 0. ]]!=[[0.0625 0.0625 0.0625 0.046875 0. ] E [0.0625 0.0625 0.0625 0.046875 0. ]]
Could you please try
test_gather
with more examples by adding--num-examples 200
.Pls let me know further if something is unclear. Thank you very much :)
Got it, I'll update my packages (my hypothesis was behind) and rerun the tests with more examples. Once I see the details of the errors and the inputs, I'll try to isolate the part of the code which might be causing issues and let you know.
Hey @Kacper-W-Kozdon Thank you for the suggestions. I don't think gradient tests should be disabled. and even with tolerance added (i've set 1e-3, try to change around if you like) we get significant value mismatches. For examples one i got was
E AssertionError: the results from backend torch and ground truth framework tensorflow do not match E [[0.20898438 0.12158203 0.12109375 0.04736328 0. ] E [0.20898438 0.12158203 0.12109375 0.04736328 0. ]]!=[[0.0625 0.0625 0.0625 0.046875 0. ] E [0.0625 0.0625 0.0625 0.046875 0. ]]
Could you please try
test_gather
with more examples by adding--num-examples 200
.Pls let me know further if something is unclear. Thank you very much :)
I ran the test with 200 examples- it passed. I repeated with 500- and I ran into a similar error. I can 100% confirm that the last test called before the failed assertions was gradient_test. Gradients run after asserting that the outputs of the gather function are correct (types and raw values). Right now I have no clue, why the gradients would fail with the direct outputs of gather seemingly matching. Also, I used the @reproduce_failure decorator with the test that failed and I received the message "E hypothesis.errors.DidNotReproduce: Expected the test to raise an error, but it completed successfully.". Could you try to do it on your end as well with the test that failed to see if it can get reproduced and what the last function called was? In my case, the decorator was @reproduce_failure('6.97.0', b'AXicY2NkYmJmZmZiYGdhPMAAB4ztf37pNYAYQHz8ApCInDmDkYGZgZGRA6KeBAAygwEAIpcHTQ==')
but I suppose that's only a local thing (I am still learning how pytest and hypothesis packages handle the testing).
I'm attaching the example which caused the error (the indices had a big shape, so it's in a .txt for readability).
Hey, @Ishticode , I'll try to either open a new issue or modify within this PR the test for gather (or rather- the helper function used in it) to limit the values in the input array to what can be correctly represented in a given dtype. Current test uses a helper function which does not have arguments to control the range of values on the input but it calls dtypes_and_values()
, which has such arguments. It seems that this (when fed into test for gradients) might cause the errors rather than the gather()
function itself.
I'd need the #28168 PR reviewed and merged first in order to solve the issue of some tests failing due to the mismatch in values and dtype (values too big for the dtype to retain the necessary precision). Based on the feedback from Ved, it should be enough to clear all the tests, given that the errors show up in the gradient test associated with test_gather through the flag in test_gather.
Ready for review (and merge) together with https://github.com/unifyai/ivy/pull/28168
@Ishticode would you mind taking look at https://github.com/unifyai/ivy/pull/28168 alongside this PR? The tests shouldn't throw any more errors when these two are merged.
Hi @Kacper-W-Kozdon Thnx for the fixes. The tests all look fine. You really have put a lot of effort into this :) I might ask @vedpatwardhan to have a quick look before merging. Sorry for the delays on my end. One question I still have is that your initial comment you said the torch backend implementation was wrong and pytest wasn't catching it. As long as i can see we have not broadened the test to include newer cases. We have only restricted it somewhat with safety and tolerances. So how do we know that those cases that were previously undetected are still not going unnoticed?
I'm not fully sure but I think my pytest was not installed properly at that time- gather() was my first contribution, the tests were passing even though they shouldn't have, which I haven't realised at first (after I reinstalled everything, they were, indeed, throwing errors).
As for the recent errors- https://github.com/unifyai/ivy/pull/28168 solved them; the tests for gather were generating values too big for their dtype
, which was causing assertion errors going beyond the tolerances. So it's the combination of both of these PRs that solved the issues.
lgtm! Could you please revert the changes to the
demos
submodule? The PR should be good to merge once that's done @Ishticode. Thanks @Kacper-W-Kozdon 😄
Done, docs/demos changes reverted 👍
@vedpatwardhan @Ishticode I added explicitly the tolerance for bfloat16
- hopefully, this will prevent the gradient test tripping the tolerances assertion on this type. Other than that, it seems to be ready.
This PR has been labelled as stale because it has been inactive for more than 7 days. If you would like to continue working on this PR, then please add another comment or this PR will be closed in 7 days.