vaex icon indicating copy to clipboard operation
vaex copied to clipboard

Interchange protocol fixes and updates

Open honno opened this issue 2 years ago • 5 comments

The interchange protocol had some spec changes recently (https://github.com/data-apis/dataframe-api/pull/74), so this PR namely updates vaex to conform with them.

  • Column.size()
    • Resolves https://github.com/vaexio/vaex/issues/2093 (was useful so I could test the updates with pandas.interchange)
    • Makes it a method, not a property, as per spec changes
  • Column.describe_categorical
    • Resolves https://github.com/vaexio/vaex/issues/2113
    • Per https://github.com/data-apis/dataframe-api/pull/69, updates it to return categories as opposed to mapping, which is a Column containing the labels
    • Update existing tests to use new Column.describe_categorical behaviour
  • Categorical columns with string labels weren't interchanged correctly (from vaex and other adopting libraries), so I made the necessary fixes to convert_categorical_column
  • Fixed some internal type hints
  • Wrote some xfailed tests from some stirng column interchange things I encountered, which should be useful down-the-line.

There's new behaviour with datatimes (NaTs are now sentinel values), but they first require vaex to support interchanging datetimes generally, so would be better served in a future PR.

cc @maartenbreddels

honno avatar Aug 04 '22 13:08 honno

I didn't realise CI has 3.6 and 3.7 jobs. I see https://github.com/vaexio/vaex/pull/1802, so assume these versions want to be generally maintained for? I have used f-stirngs in this PR, although note the df protocol implementation and tests already used them.

honno avatar Aug 04 '22 13:08 honno

Forgot to update some refs of size (i.e. use size() instead), so made the necessary updates and wrote a test. Found a pre-existing bug with Column.get_chunks() that now has some xfailed cases... probably something to fix in a future PR I imagine.

Otherwise I think I've addressed your comments now :)

As an aside, might I ask if you still need to maintain for out-dated Python versions? From a selfish perspective, removing those jobs would bring CI to green :sweat_smile:

honno avatar Sep 05 '22 09:09 honno

Yeah, Python 3.6 is still being used by clients of a client, so I'm still stuck with that :) Note that 3.6 has f-strings, just not that = syntax in the f-string.

maartenbreddels avatar Sep 06 '22 07:09 maartenbreddels

pyarrow.lib.ArrowInvalid: Dictionary array invalid: Invalid: First or last binary offset out of bounds

This seems like a legit issue :/

maartenbreddels avatar Sep 06 '22 07:09 maartenbreddels

pyarrow.lib.ArrowInvalid: Dictionary array invalid: Invalid: First or last binary offset out of bounds

This seems like a legit issue :/

I wasn't even scrolling down the failing jobs heh, didn't notice all win+mac jobs were failing. ~Seems to be a pre-existing problem, looking at the log for an older CI run, so I'd suggest we leave it for a future PR. Ideally we'd xfail those tests for those platforms—is there any platform mechanism for that already exists in vaex that I should use, or shall I try come up with something?~ Seems like an introduced issue in how I conform to the spec changes. Given the OS requirements and my ignorance on vaex/pyarrow here, I'm not particularly motivated to debug this haha, so platform-specific xfails still feel appropriate (seeing this PR is still a plus in that it conforms to the spec and can actually mostly work for a lot of interchange, if just for linux).

honno avatar Sep 06 '22 08:09 honno

Ok, sorry this takes so long, but apart from being very busy, this was difficult to debug.

The issue seems to be that convert_categorical_column calls convert_string_column, calling buffer_to_ndarray which passes the pointer to numpy, but but the nobody has a reference to the _VaexBuffer anymore, which releases the reference to the arrow array, which then release the underlying memory.

In this specific case, we get the Arrow buffers via bitmap_buffer, offsets, string_bytes = self._col.evaluate().buffers() which returns an arrow array Vaex' doesn't hold the reference to any more (as will be the case for virtual column as well).

So the question is, using the .ptr mechanism, who's responsible for holding a reference to the underlying memory?

maartenbreddels avatar Sep 30 '22 09:09 maartenbreddels

Looking at the diff again, this seems like it was an underlying issue that's only come up because this PR handles string labels correctly for categoricals? Mind I'm pretty ignorant when it comes to memory management (seems like a nightmare to debug heh).

So if this isn't an introduced bug and what was working still works, I'd recommend to just xfail the failing test cases if you don't have the bandwidth to fix this right now. I did introduce tests covering string behaviour generally, which should help prevent regressions when attempting to fix this.

honno avatar Sep 30 '22 10:09 honno

Yeah, feel free to xfail that test for now, on all platform

maartenbreddels avatar Sep 30 '22 12:09 maartenbreddels

Yeah, feel free to xfail that test for now, on all platform

Done!

honno avatar Oct 04 '22 15:10 honno