cudf
cudf copied to clipboard
[BUG] Incorrect statistics for int96 timestamps in parquet
Describe the bug
The statistics calculation is not working properly when dealing with int96 timestamp types. Adding statistics checks in test_parquet_writer_int96_timestamps
will cause pytest failures.
Steps/Code to reproduce bug
At the end of test_parquet_writer_int96_timestamps
, adding the following code:
# Read back from pyarrow
pq_file = pq.ParquetFile(gdf_fname)
# verify each row group's statistics
for rg in range(0, pq_file.num_row_groups):
pd_slice = pq_file.read_row_group(rg).to_pandas()
# statistics are per-column. So need to verify independently
for i, col in enumerate(pd_slice):
stats = pq_file.metadata.row_group(rg).column(i).statistics
actual_min = cudf.Series(pd_slice[col].explode().explode()).min()
stats_min = stats.min
assert normalized_equals(actual_min, stats_min)
actual_max = cudf.Series(pd_slice[col].explode().explode()).max()
stats_max = stats.max
assert normalized_equals(actual_max, stats_max)
The statistics test will fail.
Expected behavior The test should pass.
Additional context Not sure if the issue is in libcudf or in the python layer. Adding cpp unit tests for parquet statistics can help track down the issue.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
still relevant
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
A slight issue in the above test code which might have been the culprit behind this? Nonetheless, PR #15901 deprecates INT96 timestamps altogether indirectly solving this.
Issue TLDR: In stats = pq_file.metadata.row_group(rg).column(i).statistics
fetches the stats for ith column (0th col is index column) whereas col in enumerate(pd_slice)
enumerates from the first column after the index column.
Proposed fix:
# Read back from pyarrow
pq_file = pq.ParquetFile(gdf_fname)
# verify each row group's statistics
for rg in range(0, pq_file.num_row_groups):
pd_slice = pq_file.read_row_group(rg).to_pandas()
# statistics are per-column. So need to verify independently
for i, col in enumerate(pd_slice):
stats = pq_file.metadata.row_group(rg).column(i + 1).statistics
actual_min = cudf.Series(pd_slice[col].explode().explode()).min()
stats_min = stats.min
assert normalized_equals(actual_min, stats_min)
actual_max = cudf.Series(pd_slice[col].explode().explode()).max()
stats_max = stats.max
assert normalized_equals(actual_max, stats_max)
Update from investigation
Pyarrow does not read or write stats for int96 columns even when parquet files were written by pyarrow itself. (See the demo code below).
I have locally verified that cuDF computes correct column Statistics (min_value, max_value, has_nulls, null_count, ...
) in write_parquet
for int96 columns as well, and writes them. I have also verified that the stats were written by hacking into read_parquet_metadata
API and reading the Statistics
object.
For now (not too important) and if needed, we can just enable computing and writing int96 statistics in the libcudf code so cuDF and other readers can read them.
Demo code
# Modified test case from `test_parquet_writer_statistics` from test_parquet.py
@pytest.mark.parametrize("int96_timestamps", [True, False])
def test_parquet_writer_statistics(tmpdir, pdf, int96_timestamps):
file_path = tmpdir.join("cudf.parquet")
gdf = cudf.from_pandas(pdf)
gdf = gdf["col_datetime64[ms]"].reset_index(drop=True)[0:5].astype("datetime64[ms]")
#
# Use either pyarrow or cudf to write parquet with int96_timestamps for test purposes
#
# Use pyarrow
pa_table = gdf.to_arrow()
pq.write_table(pa_table, file_path, use_deprecated_int96_timestamps=int96_timestamps)
# Use cudf (uncomment)
# gdf.to_parquet(file_path, index=False, int96_timestamps=int96_timestamps)
# Read back from pyarrow
pq_file = pq.ParquetFile(file_path)
# verify each row group's statistics
for rg in range(0, pq_file.num_row_groups):
pd_slice = pq_file.read_row_group(rg).to_pandas()
# statistics are per-column. So need to verify independently
for i, col in enumerate(pd_slice):
stats = pq_file.metadata.row_group(rg).column(i).statistics
if stats is not None:
actual_min = pd_slice[col].min()
stats_min = stats.min
assert normalized_equals(actual_min, stats_min)
actual_max = pd_slice[col].max()
stats_max = stats.max
assert normalized_equals(actual_max, stats_max)
assert stats.null_count == pd_slice[col].isna().sum()
assert stats.num_values == pd_slice[col].count()
else:
# Same print with either PQ writer for timestamp columns
print(col, "does not have stats")
INT96 has been long deprecated, so I don't see much use in trying to get statistics to work with them. In fact, the parquet thrift file specifically says the sort order is undefined for INT96, so statistics are unusable anyway. The correct behavior IMO is to simply not write statistics for INT96.
INT96 has been long deprecated, so I don't see much use in trying to get statistics to work with them. In fact, the parquet thrift file specifically says the sort order is undefined for INT96, so statistics are unusable anyway. The correct behavior IMO is to simply not write statistics for INT96.
Make sense. We can close this issue then.
Sounds good. Closing both the PR and the issue!