[FEA] Use table_view interface in scatter-by-foo API in CUDF
Both the scatter-like and gather-like Cython interfaces to libcudf's copying API accept a list of source and target columns as a "table". This feature is, however, only used in the front-facing API by the gather-like calls.
For scatter-like calls on a dataframe, the final result is usually achieved by creating series for each column to scatter to/from and calling the singleton scatter on the series before reconstructing. This is more wasteful than it needs to be and we should probably instead just create the list of columns we wish to scatter and scatter that instead.
For example, the low-level Cython interface to scatter accepts a list of source and target columns:
https://github.com/rapidsai/cudf/blob/8055c2db6f80286b64d36f3927510bcf2e0eec02/python/cudf/cudf/_lib/copying.pyx#L240
However, the only caller in the python codebase is ColumnBase._scatter_by_column https://github.com/rapidsai/cudf/blob/8055c2db6f80286b64d36f3927510bcf2e0eec02/python/cudf/cudf/core/column/column.py#L647-L682 where we only have a single column to hand and hence can only pass one column.
This is usage is a consequence of __setitem__ on DataFrames ultimately being implemented by spinning over the columns (as series) and calling setitem on each series in turn. For example, DataFrameIlocIndexer.__setitem__ has this code (when setting columns using a dataframe as the rvalue):
https://github.com/rapidsai/cudf/blob/8055c2db6f80286b64d36f3927510bcf2e0eec02/python/cudf/cudf/core/dataframe.py#L466-L477
This calls (on line 474) Series.__setitem__ on each which either goes to series.iloc.__setitem__ or series.loc.__setitem__, and eventually devolves to ColumnBase.__setitem__.
(Aside, this is actually a bug, since iloc.__setitem__ should not do index alignment which loc will do).
Ignoring that bug issue, this code is somewhat wasteful in a number of ways:
- we eventually throw away the
Seriesobject when reconstructing the dataframe, so should just operate on columns - at this point in
DataFrame.iloc.__setitem__we already know everything about the key (and hence which low-level libcudf operation to dispatch to) so we should do that rather than going through the front door ofSeries.__setitem__which will endeavour to rediscover everything we already know. - we call into libcudf n times (once for each column), rather than once with n columns, which negates the ability to use the stream parallelism that is implemented in the
scattercode in libcudf
To fix this, I propose that DataFrame and Series should both get a new set of methods. Here's a first pass at what these should be called/their interface. These are internal and should do no argument normalisation or bounds-checking (it being the responsibility of the caller to sanitise appropriately), this includes kind/dtype casting as appropriate of the source and target.
_scatter_by_indices(self, source_columns, index_column, *, keep_index=True)_scatter_by_mask(self, source_columns, mask_column, *, keep_index=True)_scatter_by_slice(self, source_columns, slice, *, keep_index=True)
Since there is also a broadcasting scatter_from_scalar operation in libcudf we'll probably need to support that too. This should be done with a value_type tag in the interface (rather than isinstancing on the source_columns) since, again, the caller must have already determined this information.
@wence- Can you add links to the relevant sections of code that need changed, to make this more actionable? Are all the changes we need in https://github.com/rapidsai/cudf/blob/branch-23.08/python/cudf/cudf/_lib/copying.pyx , or do the callers of those functions also need changes?
@wence- Can you add links to the relevant sections of code that need changed, to make this more actionable?
Good point, added a bunch more information.
Are all the changes we need in https://github.com/rapidsai/cudf/blob/branch-23.08/python/cudf/cudf/_lib/copying.pyx , or do the callers of those functions also need changes?
Callers of those functions need to change (indeed, they need to exist, since currently almost all the calls into copying.pyx are via either Column objects or else on Series).
Related: #13532