[BUG] ORC string sum statistics are wrong
When reading the statistics for an ORC file written by cuDF, the result for sum is wrong when read using cuDF and absent when using pyorc.
In [1]: import cudf
In [2]: import pyorc
In [3]: gdf = cudf.DataFrame({'b':[1,7], 'a':['Badam khao', 'roz']})
In [4]: gdf.to_orc("temp.orc")
In [5]: cudf.io.orc.read_orc_statistics(["temp.orc"])
Out[5]:
([{'col0': {'number_of_values': 2},
'b': {'number_of_values': 2, 'minimum': 1, 'maximum': 7, 'sum': 8},
'a': {'number_of_values': 2,
'minimum': 'Badam khao',
'maximum': 'roz',
'sum': -7}}],
[{'col0': {'number_of_values': 2},
'b': {'number_of_values': 2, 'minimum': 1, 'maximum': 7, 'sum': 8},
'a': {'number_of_values': 2,
'minimum': 'Badam khao',
'maximum': 'roz',
'sum': -7}}])
In [6]: f = open("temp.orc", 'rb')
In [7]: r = pyorc.Reader(f)
In [8]: r[1].statistics
Out[8]:
{'has_null': False,
'number_of_values': 2,
'minimum': 1,
'maximum': 7,
'sum': 8,
'kind': <TypeKind.LONG: 4>}
In [9]: r[2].statistics
Out[9]: {'has_null': False, 'number_of_values': 2, 'kind': <TypeKind.STRING: 7>}
Expected result
Sum statistics contains the sum of lengths of all the strings in the column. We do correctly compute this in libcudf, so it should be present when reading with pyorc and correct when reading with cudf.
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
This is a correctness issue, prioritizing for 22.10.
Additional info related to the pyorc part of the issue: Spark is able to read ORC string column statistics, and uses them for predicate based filtering.