arrow ARROW-17923: [C++] Consider dictionary arrays for special fragment fields

This PR modifies the __filename field in a dataset fragment to be a DictionaryScalar of string instead of a string field.

Oct 25 '22 04:10 sanjibansg

https://issues.apache.org/jira/browse/ARROW-17923

Oct 25 '22 04:10 github-actions[bot]

I'm probably missing something, but does this get turned into a dictionary array at some point? Returning dictionary scalars isn't really useful... @westonpace

Oct 31 '22 20:10 pitrou

Yes, it gets broadcast to an array before it is sent out. Although your point is a valid one, a dictionary scalar is probably a smell of some kind. I had not been thinking of it this way initially. Perhaps a more general fix would be that, when we broadcast a scalar, if the type is a binary data type, we could always broadcast it into a dictionary array.

I think it might be worth an attempt to play around with this idea a bit. I suspect we might run into problems with columns that may or may not be dictionary. For example, if a column happens to be a partition column, we can represent it as a scalar. However, we don't necessarily know if a column is a partition column or not when we are constructing the plan, and we might bind kernels thinking the type is a normal type and then suddenly get a dictionary array.

So __filename is a little bit special in that it is the only field that easily know will always be a scalar.

Oct 31 '22 23:10 westonpace

Perhaps a more general fix would be that, when we broadcast a scalar, if the type is a binary data type, we could always broadcast it into a dictionary array.

I worry that doing this might mean the results of ExecBatch::ToRecordBatch would return a batch with an unexpected schema, if we transform it like that.

Nov 15 '22 18:11 wjones127

Looking at the R test failures, it seems like this change is blocked because hash join and aggregate don't support unifying dictionaries. Thinking about it, having to do that might be expensive, since that requires re-mapping indices each time.

Instead, perhaps in MakeScanNode we could pre-compute the dictionary buffer and pass that to the generator that processes each fragment. @sanjibansg do you want to try that?

Nov 15 '22 18:11 wjones127

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

Mar 30 '23 17:03 amol-