ray [Data] Add support for objects to Arrow blocks

Why are these changes needed?

Currently, Ray does not support blocks/batches with objects and multi-dimensional arrays in different columns. This causes Ray Data to throw exceptions when these are provided because:

Since there's an arbitrary object in the batch, the Arrow block format fails with ArrowNotImplemented with dtype 17. This falls back to return pd.DataFrame(dict(batch)) in BlockAccessor.batch_to_block.
However, this particular DataFrame constructor does not support columns with numpy.ndarray objects, so it throws the exception listed in the linked issue.

This change enables Python object storage in the Arrow blocks by defining an Arrow extension type that simply represents the Python objects as a variable-sized large binary. I suppose the alleged performance benefits listed in the comments are an extra benefit.

I'm not sure that this is the correct approach or that I've properly patched all of the places, so some help would be appreciated!

Related issue number

Resolves #45235

Checks

[X] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
[X] I've run scripts/format.sh to lint the changes in this PR.
[ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
[ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- [X] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(

May 11 '24 22:05 terraflops1048576

@terraflops1048576 this could be a big contribution but we want to think a little deeper about this with you to properly codevelop this; can you reach out to me on Ray Slack so we can setup some time to discuss further.

My handle is "Sam (Ray Team)"

We should get a formal REP in ray-enhancements for this as well: https://github.com/ray-project/enhancements

cc @bveeramani

May 16 '24 21:05 anyscalesam

@raulchen is this ready to merge?

Jun 25 '24 21:06 anyscalesam

sorry, I got disruptions last week. doing the 2nd pass of review now. Also, could you resolve the conflicts when you have a chance? thanks

Jun 25 '24 22:06 raulchen

After trying to fix the test failures that relate to the new exception info, I realized that this is actually rather problematic -- it fixes way too many errors that would otherwise be reported to the user, like the overflow error resulting from [1, 2**100], trying to concatenate np.array([2**100]) and np.array([1]), etc. After all, it is perfectly legal to pickle all of these things and store it in a column, but we obviously don't want to do this for performance reasons. We would rather error and tell the user to adjust the problematic values.

I'm not sure what we should do with the np.array case, as they are technically objects, but maybe we shouldn't fall back if anything related to ArrowTensorArray fails.

Jun 26 '24 08:06 terraflops1048576

Some tests are failing. can you take a look? https://buildkite.com/ray-project/microcheck/builds/2214

Jun 26 '24 23:06 raulchen

I've tried to separate out overflows and some other errors like mismatched float/int types and make them continue reporting errors as in the tests I've fixed, instead of saving them as objects in the PyArrow table, primarily just to avoid unintended consequences.

You are correct that if we enabled catching all ArrowConversionErrors and turning things into objects that as long as blocks were completely non-overflowed integers, they would be stored as integer type. Currently, np.ndarrays with object dtype (such as when a user tries to put in overflowed integers) are just pickled as objects, because it's really hard to detect whether it's the result of overflowed integers or just plain objects in there.

Jun 27 '24 03:06 terraflops1048576

@terraflops1048576 the CI failures are unrelated and already fixed in master. Can you merge the latest master again?

Jul 11 '24 18:07 raulchen

I'm not sure why the tests that I fixed here weren't broken before, but I included a fix.

Interestingly enough, the documentation for ray.data.Dataset.union says

The datasets must have the same schema as this dataset, otherwise the behavior is undefined.

However this isn't actually the case anymore and things like python/ray/data/tests/test_consumption.py::test_union actually depend on different behavior. Nevertheless, I fixed unify_schemas to reflect this behavior.

Also, I realized today while doing this fix that PyArrow tables can actually have duplicate column names -- this is an edge case that I don't really see treated anywhere (in particular, converting to pandas, which does not allow this, seems broken). I added something that just throws an exception if it sees it in this particular case.

Jul 13 '24 17:07 terraflops1048576