dask-awkward
dask-awkward copied to clipboard
`_finalize_array` must call `ak.enforce_type` for concatenating categoricals
Test failure:
FAILED tests/test_str.py::test_to_categorical - NotImplementedError: merging categorical arrays is currently not implemented. Use `ak.enforce_type` to drop the categorical type and use general merging.
When .compute() is called on a categorical array collection there is a call to ak.concatenate which happens in dask_awkward.lib.core._finalize_array, we need to add a check so that we can call enforce_type when necessary
Or perhaps we require a dak.enforce_type call at the end of a task graph that converges to some array of categoricals. cc @agoose77 for your take!
When merging categoricals, we want to preserve the uniqueness of the categorical content. If all contents are identical, and therefore only the indices vary, we could define a trivial merge rule. If this is not the case, then we'd need to re-categorise the result.
Right now, Awkward does neither; 2.5.0 just ensures that we properly complain to the user. Users can drop the categorical-ness at the point of computation to ensure that we drop the categorical-ness, and then re-categorize with ak.str.to_categorical after computation.
@jpivarski and I discussed adding a built-in operation to support categorical merging on strings, which is the most likely use-case, over in scikit-hep/awkward#2853. For now, the error is probably OK in my view, but I defer to other people's judgement!
It's a reminder to implement it when it's needed, which may be soon.