vaex icon indicating copy to clipboard operation
vaex copied to clipboard

dont use take with arrow

Open Ben-Epstein opened this issue 2 years ago • 3 comments

This directly addresses https://github.com/vaexio/vaex/issues/2335 And is directly the fix for https://issues.apache.org/jira/browse/ARROW-9773 (which is now https://github.com/apache/arrow/issues/33049)

I believe fixing all of the .takes to .slice would also fix https://github.com/vaexio/vaex/issues/2334 because .take uses memory, but .slice is zero-copy. The memory is exploding going to hdf5 because we keep .takeing

You can see huggingface datasets does the same thing: https://github.com/huggingface/datasets/pull/645/files

That being said, there are a number of other places that vaex uses .take which should be fixed. But because of the lack of typing in the vaex repo, it's hard for me to know which ones are pyarrow arrays, which are pyarrow tables, and which are numpy arrays. I'm happy to help move the rest over, but I would need some guidance.

Here are all of the places .take is used

./packages/vaex-core/vaex/functions.py:    values = choices.take(indices)
./packages/vaex-core/vaex/dataframe.py:                    used_keys = keys.take(codes)
./packages/vaex-core/vaex/dataframe.py:                        keys = keys.take(indices)
./packages/vaex-core/vaex/dataframe.py:    def take(self, indices, filtered=True, dropfilter=True):
./packages/vaex-core/vaex/dataframe.py:        >>> df.take([0,2])
./packages/vaex-core/vaex/dataframe.py:        df.dataset = df.dataset.take(indices)
./packages/vaex-core/vaex/dataframe.py:        return self.take(indices)
./packages/vaex-core/vaex/dataframe.py:        return self.take(indices).split(into)
./packages/vaex-core/vaex/dataframe.py:        return self.take(indices)
./packages/vaex-core/vaex/cpu.py:                keys = keys.take(indices)
./packages/vaex-core/vaex/cpu.py:        keys = keys.take(order)
./packages/vaex-core/vaex/join.py:            left = left.concat(left.take(lookup_left))
./packages/vaex-core/vaex/join.py:            left = left.take(left_indices_matched, filtered=False, dropfilter=False)
./packages/vaex-core/vaex/join.py:            right_dataset = right_dataset.take(lookup, masked=any(lookup_masked))
./packages/vaex-core/vaex/dataset.py:    def take(self, indices, masked=False):
./packages/vaex-core/vaex/dataset.py:        return DatasetTake(self, indices, masked=masked)
./packages/vaex-core/vaex/dataset.py:class DatasetTake(DatasetDecorator):
./packages/vaex-core/vaex/hash.py:        keys = pa.compute.take(keys, indices)
./packages/vaex-core/vaex/array_types.py:def take(ar, indices):
./packages/vaex-core/vaex/array_types.py:    return ar.take(indices)
./packages/vaex-core/vaex/groupby.py:                        self.bin_values = pa.compute.take(self.bin_values, self.sort_indices)
./packages/vaex-core/vaex/groupby.py:                        bin_values[field.name] = ar.take(indices)
./packages/vaex-core/vaex/groupby.py:                    bin_values[parent.label] = parent.bin_values.take(indices)
./packages/vaex-core/vaex/groupby.py:            self.bin_values = pa.compute.take(self.bin_values, sort_indices)
./packages/vaex-core/vaex/groupby.py:            values = pa.compute.take(values, indices)
./packages/vaex-core/vaex/groupby.py:                        ar = np.take(ar, sort_indices, axis=i)
./packages/vaex-core/vaex/groupby.py:                            columns[by.label] = vaex.array_types.take(by.bin_values, index)
./packages/vaex-core/vaex/groupby.py:                        columns[by.label] = by.bin_values.take(indices)
./packages/vaex-core/vaex/column.py:            ar = ar_unfiltered.take(vaex.array_types.to_arrow(take_indices))
./packages/vaex-core/vaex/column.py:        x = x.dictionary.take(x.indices)  # equivalent to PyArrow 5.0.0's dictionary_decode() but backwards compatible

Ben-Epstein avatar Feb 09 '23 17:02 Ben-Epstein

The memory is exploding going to hdf5 because we keep .takeing

Yeah, when using sliced arrays, that seems to be the case. It will try to concatenate them first, which will explode the memory use!

This is quite a bad arrow situation... :/ digesting this

maartenbreddels avatar Feb 13 '23 15:02 maartenbreddels

@maartenbreddels i updated the function to use the array_types take function, but i wasn't able to update the others. take is used in a bunch of places and they aren't always from a pyarrow array. Without typing, it's not clear to me when it's on a dataframe, or a dataset, or something else.

I tried to dig into the dataframe take which led to the dataset take and then this DatasetTake class but I can't really understand what it's doing. Would you mind helping me out with that part?

Ben-Epstein avatar Feb 13 '23 18:02 Ben-Epstein

I'll scan over it tomorrow!

maartenbreddels avatar Feb 13 '23 19:02 maartenbreddels