dask-awkward icon indicating copy to clipboard operation
dask-awkward copied to clipboard

`_finalize_array` must call `ak.enforce_type` for concatenating categoricals

Open douglasdavis opened this issue 1 year ago • 3 comments

Test failure:

FAILED tests/test_str.py::test_to_categorical - NotImplementedError: merging categorical arrays is currently not implemented. Use `ak.enforce_type` to drop the categorical type and use general merging.

When .compute() is called on a categorical array collection there is a call to ak.concatenate which happens in dask_awkward.lib.core._finalize_array, we need to add a check so that we can call enforce_type when necessary

douglasdavis avatar Dec 05 '23 22:12 douglasdavis

Or perhaps we require a dak.enforce_type call at the end of a task graph that converges to some array of categoricals. cc @agoose77 for your take!

douglasdavis avatar Dec 07 '23 03:12 douglasdavis

When merging categoricals, we want to preserve the uniqueness of the categorical content. If all contents are identical, and therefore only the indices vary, we could define a trivial merge rule. If this is not the case, then we'd need to re-categorise the result.

Right now, Awkward does neither; 2.5.0 just ensures that we properly complain to the user. Users can drop the categorical-ness at the point of computation to ensure that we drop the categorical-ness, and then re-categorize with ak.str.to_categorical after computation.

@jpivarski and I discussed adding a built-in operation to support categorical merging on strings, which is the most likely use-case, over in scikit-hep/awkward#2853. For now, the error is probably OK in my view, but I defer to other people's judgement!

agoose77 avatar Dec 08 '23 13:12 agoose77

It's a reminder to implement it when it's needed, which may be soon.

jpivarski avatar Dec 08 '23 14:12 jpivarski