ARROW-17923: [C++] Consider dictionary arrays for special fragment fields
This PR modifies the __filename field in a dataset fragment to be a DictionaryScalar of string instead of a string field.
https://issues.apache.org/jira/browse/ARROW-17923
I'm probably missing something, but does this get turned into a dictionary array at some point? Returning dictionary scalars isn't really useful... @westonpace
Yes, it gets broadcast to an array before it is sent out. Although your point is a valid one, a dictionary scalar is probably a smell of some kind. I had not been thinking of it this way initially. Perhaps a more general fix would be that, when we broadcast a scalar, if the type is a binary data type, we could always broadcast it into a dictionary array.
I think it might be worth an attempt to play around with this idea a bit. I suspect we might run into problems with columns that may or may not be dictionary. For example, if a column happens to be a partition column, we can represent it as a scalar. However, we don't necessarily know if a column is a partition column or not when we are constructing the plan, and we might bind kernels thinking the type is a normal type and then suddenly get a dictionary array.
So __filename is a little bit special in that it is the only field that easily know will always be a scalar.
Perhaps a more general fix would be that, when we broadcast a scalar, if the type is a binary data type, we could always broadcast it into a dictionary array.
I worry that doing this might mean the results of ExecBatch::ToRecordBatch would return a batch with an unexpected schema, if we transform it like that.
Looking at the R test failures, it seems like this change is blocked because hash join and aggregate don't support unifying dictionaries. Thinking about it, having to do that might be expensive, since that requires re-mapping indices each time.
Instead, perhaps in MakeScanNode we could pre-compute the dictionary buffer and pass that to the generator that processes each fragment. @sanjibansg do you want to try that?
Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍