pybind11 icon indicating copy to clipboard operation
pybind11 copied to clipboard

Passing reference arguments to trampoline methods [smart_holder]

Open rhaschke opened this issue 4 years ago • 14 comments

This replaces #2911 and mimics #2915 in master.

This introduces a set of new unittests revealing that reference arguments passed to trampoline methods, i.e. Python-overridden virtual methods of C++ classes, are not actually passed by reference, but by copy. This is due to the fact that return_value_policy::copy is used by default to pass from C++ to Python. Using return_value_policy::reference for calls from those trampoline methods essentially solves the issue.

rhaschke avatar Mar 23 '21 09:03 rhaschke

Yes, it is. See #2915 where I request this change in master.

rhaschke avatar Mar 23 '21 17:03 rhaschke

@rwgk, I think I have addressed all your feedback

rhaschke avatar Mar 24 '21 12:03 rhaschke

Thanks Robert! I'll run this through our global testing system.

rwgk avatar Mar 24 '21 12:03 rwgk

I'm seeing ~10 test failures in our global testing. It is running a very large number of tests, i.e. this is a very rare failure, but I verified manually that it is reproducible and also generates a stack-use-after-return failure when running with ASAN. One of the crashes involves this open-source code:

https://github.com/google/mediapipe/blob/a92cff7a60031f5c3097b06e74416732d85b5011/mediapipe/python/pybind/packet_getter.cc#L352

  m->def(
      "_get_proto_type_name",
      [](const Packet& packet) {
        return packet.GetProtoMessageLite().GetTypeName();
      },
      py::return_value_policy::move);

Drilling down, the signatures are:

const proto_ns::MessageLite& GetProtoMessageLite() const;
virtual std::string GetTypeName() const = 0;

It seems to boil down to py::return_value_policy::move used in combination with a std::string by-value return. Do you have an idea how this PR could lead to this breaking?

I'm pretty sure (but have not verified) that this code does not use smart_holder.

rwgk avatar Mar 25 '21 01:03 rwgk

Not sure. Need to investigate. What's the failure? stack-use-after-return only? So I will need valgrind to validate?

rhaschke avatar Mar 25 '21 16:03 rhaschke

Not sure. Need to investigate. What's the failure? stack-use-after-return only? So I will need valgrind to validate?

FYI: After I saw your message I tried to put together a minimal reproducer based on the string-return-by-value-combined-with-policy_move suspicion, but that works just fine. I started drilling down into the original failure (segfault without ASAN, stack-use-after-return with ASAN) and it's looking complicated ... I'll try to drill down more.

rwgk avatar Mar 25 '21 18:03 rwgk

Not sure. Need to investigate. What's the failure? stack-use-after-return only? So I will need valgrind to validate?

FYI: After I saw your message I tried to put together a minimal reproducer based on the string-return-by-value-combined-with-policy_move suspicion, but that works just fine. I started drilling down into the original failure (segfault without ASAN, stack-use-after-return with ASAN) and it's looking complicated ... I'll try to drill down more.

Sharing the next observation:

  • The test works with the smart_holder branch as-is.
  • But simply applying the change to pytypes.h breaks it.
-    template <return_value_policy policy = return_value_policy::automatic_reference, typename... Args>
+    template <return_value_policy policy = return_value_policy::reference, typename... Args>
     object operator()(Args &&...args) const;                                   
-    template <return_value_policy policy = return_value_policy::automatic_reference, typename... Args>
+    template <return_value_policy policy = return_value_policy::reference, typename... Args>

I'm still (very) unclear how that change leads to the breakage.

rwgk avatar Mar 25 '21 18:03 rwgk

Below is the documentation for return_value_policy::reference and return_value_policy::automatic_reference. It seems to me that the failing test must be connected to a difference in behavior for non-pointer returns.

@rhaschke, do you have an idea for a minimal reproducer based on that theory? The failing test is very complex, I'm hoping you have an idea that saves me the effort trying to understand and reduce it.

+--------------------------------------------------+----------------------------------------------------------------------------+
| :enum:`return_value_policy::reference`           | Reference an existing object, but do not take ownership. The C++ side is   |
|                                                  | responsible for managing the object's lifetime and deallocating it when    |
|                                                  | it is no longer used. Warning: undefined behavior will ensue when the C++  |
|                                                  | side deletes an object that is still referenced and used by Python.        |
+--------------------------------------------------+----------------------------------------------------------------------------+
| :enum:`return_value_policy::automatic`           | **Default policy.** This policy falls back to the policy                   |
|                                                  | :enum:`return_value_policy::take_ownership` when the return value is a     |
|                                                  | pointer. Otherwise, it uses :enum:`return_value_policy::move` or           |
|                                                  | :enum:`return_value_policy::copy` for rvalue and lvalue references,        |
|                                                  | respectively. See above for a description of what all of these different   |
|                                                  | policies do.                                                               |
+--------------------------------------------------+----------------------------------------------------------------------------+
| :enum:`return_value_policy::automatic_reference` | As above, but use policy :enum:`return_value_policy::reference` when the   |
|                                                  | return value is a pointer. This is the default conversion policy for       |
|                                                  | function arguments when calling Python functions manually from C++ code    |
|                                                  | (i.e. via handle::operator()). You probably won't need to use this.        |
+--------------------------------------------------+----------------------------------------------------------------------------+

rwgk avatar Mar 25 '21 18:03 rwgk

I think the doc simply states that also pointers converted by reference, i.e. not taking ownership. The key difference in behavior is that everything is converted as reference (instead of copy for non-pointers). Do you have a stack trace of the failing test? Particularly, do you know from which piece of code the invalidated stack is accessed?

rhaschke avatar Mar 25 '21 19:03 rhaschke

I think the doc simply states that also pointers converted by reference, i.e. not taking ownership. The key difference in behavior is that everything is converted as reference (instead of copy for non-pointers). Do you have a stack trace of the failing test? Particularly, do you know from which piece of code the invalidated stack is accessed?

I looked at the stack trace and code for almost two hours but I'm still confused, it's a very complex setup. When running with ASAN a certain pointer is a nullptr. When running without ASAN it's not a nullptr, but dereferencing the pointer leads to a segfault, which means it must be garbage. It could take a long time (hours) to reduce. I'll try to find time later, but if you have any idea for a unit test that is sensitive to the behavior change caused by the pytypes.h code change, that may help a lot.

rwgk avatar Mar 25 '21 19:03 rwgk

I don't think strings are affected by the PR at all: The corresponding caster ignores the return_value_policy and always copies: https://github.com/pybind/pybind11/blob/1bafd5db5fec75bafd6986af8157a61035aced60/include/pybind11/cast.h#L410-L416

rhaschke avatar Mar 25 '21 19:03 rhaschke

Ah. I just noticed that GetTypeName() is a virtual (even abstract) method, which is probably then implemented in Python. I will have a look into that direction now.

rhaschke avatar Mar 25 '21 19:03 rhaschke

I couldn't find an issue when tracing the call hierarchy of this example:

struct StringReturner {
    virtual std::string get_string() { return std::string("foo"); }
};

py::classh<StringReturner, PyStringReturner>(m, "StringReturner")
	.def(py::init<>())
	.def("get_string", &StringReturner::get_string);

m.def("check_string", [](StringReturner &caller) {
	return caller.get_string();
}, py::return_value_policy::move);
class PyStringReturner(m.StringReturner):
    def get_string(self):
        return "bar"

rhaschke avatar Mar 25 '21 20:03 rhaschke

This may be affected by the whitespace changes under PR #3073. Sorry for the trouble. I looked into rebasing but it's not easy even without the whitespace changes (I think). There are merge conflicts in these two files:

tests/test_class_sh_basic.cpp
tests/test_class_sh_trampoline_basic.cpp

rwgk avatar Jul 03 '21 00:07 rwgk