nanobind bind_map/bind_vector: allow customizing the getitem return_value

I'm as much a fan as anyone of safe-by-default, but I also maintain some code that relied on the old reference semantics for bind_map/bind_vector, including taking the appropriate level of care not to store the item-reference Python objects past the lifetime of their C++ referent. In order to avoid needing to maintain a local copy of these binding functions just to change one word, I'm hoping you'll consider this PR, which keeps the safe default but allows it to be changed using a 2nd template argument to bind_map/bind_vector. It also includes some expanded discussion in the docs, which reflects that the reference semantics are much more unsafe for std::vector (where inserting or erasing any element can invalidate references to all elements) than they are for std::map/unordered_map (where the only thing that can invalidate references to an element is removing that specific element).

This changes the default getitem rv_policy from std::is_pointer_v<Value> ? rv_policy::take_ownership : rv_policy::copy to rv_policy::automatic_reference. The main difference is that a non-intrusive-refcounting pointer will use reference instead of take_ownership, which seems like an obvious win to me. (When intrusive refcounting is used, the rv_policy is always overridden to be take_ownership in nb_type_put_common(), so it doesn't matter what we pass in that case.) A secondary difference is that if the map iterator returns a temporary value (rather than a reference) we can now move out of it. (Such iterators don't currently work for other reasons, but I'm about to put up a PR that fixes those.)

While working on this, I noticed that the container iterators still use the reference_internal policy. I'm not sure if you intended this or not; I can see the argument that someone is less likely to hold onto an object they got during iteration, but it's certainly not impossible. I didn't make any changes here, but if you'd like to, I would suggest using the same getitem_policy for the iterator result types as well.

Mar 06 '24 07:03 oremanj

In the case that this API is used to create an unsafe container, should it be a sort of "structural static read-write view"? In the sense that we would simply not bind functions that are almost certainly going to segfault your process: insertion, deletion, etc.

Mar 06 '24 12:03 wjakob

I added some examples of innocuous-looking UB.

I was thinking of having a custom iterator that reindexes into the arrays/maps instead of using original STL iterators. It would need to store a strong reference to the Python object along with the current index or key.

That could guard against some of the dangers of concurrent modification during iteration, but it basically just reduces the iterator-safety problem to the __getitem__-safety problem. Imagine:

struct Inner { int v; };
struct Outer { Inner inner; };
nb::class_<Inner>(m, "Inner").def_rw("v", &Inner::v);
nb::class_<Outer>(m, "Outer").def_rw("inner", &Outer::inner);
nb::bind_vector<std::vector<Outer>>(m, "OuterVec");

After some Python code like:

vec = ext.OuterVec()
for i in range(10):
    elem = ext.Outer()
    elem.inner.v = i
    vec.append(elem)

values = [elem.inner for elem in vec if elem.inner % 2 == 0]

you now have an array of values that refer to memory owned by vec, which can be invalidated by various things you might later do to it. (This is because def_rw uses rv_policy::reference_internal by default, but it's kind of hard for it to do its job if it doesn't.)

Currently __getitem__ copies by default, but the iterator __next__ still returns a reference, so it presents the same sort of problems that __getitem__ used to present. I think it might be more consistent to have them all use the same RVP, but of course that would also increase the potential for confusion and breakage of existing code.

In the case that this API is used to create an unsafe container, should it be a sort of "structural static read-write view"? In the sense that we would simply not bind functions that are almost certainly going to segfault your process: insertion, deletion, etc.

If the user has read our extensive warnings and decided to do the unsafe thing anyway, I don't really think it's our place to try to prevent them from doing what they want to do. It's perfectly possible to use a reference-semantic vector in ways that don't crash your program, even without forbidding insertions and deletions; C++ developers do it all the time! I also think it's telling that bind_vector has been in production use for quite a while (certainly if we count pybind) without generating a notable volume of user complaints of the sort that prompted the change to copy semantics. It's not safe in reference mode, but I think "almost certainly going to segfault your process" is going a little far.

Mar 06 '24 12:03 oremanj

voicing my support for this feature https://github.com/wjakob/nanobind/discussions/564

May 07 '24 03:05 jeffective

Sorry about taking a while to get to this. This change looks good, and I can't come up with a better way of doing things. Thank you for the nice documentation and helpful changes as always @oremanj .

May 23 '24 03:05 wjakob

FYI: https://github.com/wjakob/nanobind/commit/0e1fe84fa2965c364fc2cf465d7626d0a19228d7 makes a similar change for the iterators. The STL vector/map bindings pass their Policy argument to make_*_iterator

May 23 '24 04:05 wjakob

bind_map/bind_vector: allow customizing the getitem return_value_policy