arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-17524: [C++] Correction for fields included when reading an ORC table

Open LouisClt opened this issue 2 years ago • 9 comments

I think there is a bug in the ORC reader : when we specify the fields indexes that we want to keep, it does not work correctly. Looking at the code, it seems to be because we do "includeTypes" in lieue of "include" when setting the ORC options. It can be problematic when we want to import an ORC table containing Union types as it will do an error at the import, even if we try not to import these specific fields.

The definitions of the corresponding ORC methods are here : https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L185-L191

and https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L201-L207

LouisClt avatar Aug 24 '22 14:08 LouisClt

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

github-actions[bot] avatar Aug 24 '22 16:08 github-actions[bot]

@LouisClt Thanks for the report and attempt at fixing! Could you open a JIRA ticket as described above?

pitrou avatar Aug 25 '22 07:08 pitrou

@pitrou Yes I can. Here is the new Jira https://issues.apache.org/jira/browse/ARROW-17524

LouisClt avatar Aug 25 '22 13:08 LouisClt

https://issues.apache.org/jira/browse/ARROW-17524

github-actions[bot] avatar Aug 25 '22 19:08 github-actions[bot]

:warning: Ticket has not been started in JIRA, please click 'Start Progress'.

github-actions[bot] avatar Aug 25 '22 19:08 github-actions[bot]

C Glib tests do not pass because it test the old behaviour. Looking a bit more in detail at the difference between "include" and "includeTypes", it appears that "includeTypes" is based on indices of the tables types. The root of the tree (meaning the whole table) is type of index 0, and when there are complex types such as structs would give other "child" types. See https://orc.apache.org/docs/types.html for more information. "include" select the fields in the same way than the other imports : each table has a certain number of fields, (we do not take into account the "children" types), if we want to select the first field, we put "0" in the list.

I think the 2 behaviours could be understandable but the one with "include" is more coherent with the other imports. Furthermore, I do not know if there is a way in Arrow to get the list of internal ORC types, which makes selecting fields with "includeTypes" much more unreliable.

LouisClt avatar Sep 23 '22 14:09 LouisClt

So I changed the tests to reflect the new behaviour of selecting the fields. The job failure that remains seems to be unrelated. There is a decision to be made whether or not the intended behaviour of SelectFields is the previous one or this one.

LouisClt avatar Sep 27 '22 07:09 LouisClt

@LouisClt Feel free to say when this is ready for another review.

pitrou avatar Oct 05 '22 13:10 pitrou

Yes, this should be good now. I added a test in the ORC adapter concerning the selection of fields. Some tests did not pass, but I don't think it is related.

LouisClt avatar Oct 06 '22 07:10 LouisClt

CI failures are unrelated.

pitrou avatar Oct 18 '22 20:10 pitrou

Good to know, thanks @pitrou

LouisClt avatar Oct 18 '22 20:10 LouisClt