cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[BUG] ORC string sum statistics are wrong

Open devavret opened this issue 4 years ago • 3 comments

When reading the statistics for an ORC file written by cuDF, the result for sum is wrong when read using cuDF and absent when using pyorc.

In [1]: import cudf

In [2]: import pyorc

In [3]: gdf = cudf.DataFrame({'b':[1,7], 'a':['Badam khao', 'roz']})

In [4]: gdf.to_orc("temp.orc")

In [5]: cudf.io.orc.read_orc_statistics(["temp.orc"])
Out[5]: 
([{'col0': {'number_of_values': 2},
   'b': {'number_of_values': 2, 'minimum': 1, 'maximum': 7, 'sum': 8},
   'a': {'number_of_values': 2,
    'minimum': 'Badam khao',
    'maximum': 'roz',
    'sum': -7}}],
 [{'col0': {'number_of_values': 2},
   'b': {'number_of_values': 2, 'minimum': 1, 'maximum': 7, 'sum': 8},
   'a': {'number_of_values': 2,
    'minimum': 'Badam khao',
    'maximum': 'roz',
    'sum': -7}}])

In [6]: f = open("temp.orc", 'rb')

In [7]: r = pyorc.Reader(f)

In [8]: r[1].statistics
Out[8]: 
{'has_null': False,
 'number_of_values': 2,
 'minimum': 1,
 'maximum': 7,
 'sum': 8,
 'kind': <TypeKind.LONG: 4>}

In [9]: r[2].statistics
Out[9]: {'has_null': False, 'number_of_values': 2, 'kind': <TypeKind.STRING: 7>}

Expected result

Sum statistics contains the sum of lengths of all the strings in the column. We do correctly compute this in libcudf, so it should be present when reading with pyorc and correct when reading with cudf.

devavret avatar Sep 27 '21 11:09 devavret

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Nov 15 '21 21:11 github-actions[bot]

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] avatar Feb 13 '22 22:02 github-actions[bot]

This is a correctness issue, prioritizing for 22.10.

vuule avatar Aug 02 '22 20:08 vuule

Additional info related to the pyorc part of the issue: Spark is able to read ORC string column statistics, and uses them for predicate based filtering.

vuule avatar Sep 26 '22 21:09 vuule