arrow
arrow copied to clipboard
ARROW-17524: [C++] Correction for fields included when reading an ORC table
I think there is a bug in the ORC reader : when we specify the fields indexes that we want to keep, it does not work correctly. Looking at the code, it seems to be because we do "includeTypes" in lieue of "include" when setting the ORC options. It can be problematic when we want to import an ORC table containing Union types as it will do an error at the import, even if we try not to import these specific fields.
The definitions of the corresponding ORC methods are here : https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L185-L191
and https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L201-L207
Thanks for opening a pull request!
If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW
Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.
Then could you also rename pull request title in the following format?
ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
or
MINOR: [${COMPONENT}] ${SUMMARY}
See also:
@LouisClt Thanks for the report and attempt at fixing! Could you open a JIRA ticket as described above?
@pitrou Yes I can. Here is the new Jira https://issues.apache.org/jira/browse/ARROW-17524
https://issues.apache.org/jira/browse/ARROW-17524
:warning: Ticket has not been started in JIRA, please click 'Start Progress'.
C Glib tests do not pass because it test the old behaviour. Looking a bit more in detail at the difference between "include" and "includeTypes", it appears that "includeTypes" is based on indices of the tables types. The root of the tree (meaning the whole table) is type of index 0, and when there are complex types such as structs would give other "child" types. See https://orc.apache.org/docs/types.html for more information. "include" select the fields in the same way than the other imports : each table has a certain number of fields, (we do not take into account the "children" types), if we want to select the first field, we put "0" in the list.
I think the 2 behaviours could be understandable but the one with "include" is more coherent with the other imports. Furthermore, I do not know if there is a way in Arrow to get the list of internal ORC types, which makes selecting fields with "includeTypes" much more unreliable.
So I changed the tests to reflect the new behaviour of selecting the fields. The job failure that remains seems to be unrelated. There is a decision to be made whether or not the intended behaviour of SelectFields is the previous one or this one.
@LouisClt Feel free to say when this is ready for another review.
Yes, this should be good now. I added a test in the ORC adapter concerning the selection of fields. Some tests did not pass, but I don't think it is related.
CI failures are unrelated.
Good to know, thanks @pitrou