cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[BUG] Incorrect statistics for int96 timestamps in parquet

Open PointKernel opened this issue 2 years ago • 4 comments

Describe the bug The statistics calculation is not working properly when dealing with int96 timestamp types. Adding statistics checks in test_parquet_writer_int96_timestamps will cause pytest failures.

Steps/Code to reproduce bug At the end of test_parquet_writer_int96_timestamps, adding the following code:

    # Read back from pyarrow
    pq_file = pq.ParquetFile(gdf_fname)
    # verify each row group's statistics
    for rg in range(0, pq_file.num_row_groups):
        pd_slice = pq_file.read_row_group(rg).to_pandas()

        # statistics are per-column. So need to verify independently
        for i, col in enumerate(pd_slice):
            stats = pq_file.metadata.row_group(rg).column(i).statistics

            actual_min = cudf.Series(pd_slice[col].explode().explode()).min()
            stats_min = stats.min
            assert normalized_equals(actual_min, stats_min)

            actual_max = cudf.Series(pd_slice[col].explode().explode()).max()
            stats_max = stats.max
            assert normalized_equals(actual_max, stats_max)

The statistics test will fail.

Expected behavior The test should pass.

Additional context Not sure if the issue is in libcudf or in the python layer. Adding cpp unit tests for parquet statistics can help track down the issue.

PointKernel avatar Mar 15 '22 17:03 PointKernel

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Apr 14 '22 18:04 github-actions[bot]

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] avatar Jul 13 '22 19:07 github-actions[bot]

still relevant

PointKernel avatar Jul 13 '22 19:07 PointKernel

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] avatar Oct 11 '22 19:10 github-actions[bot]

A slight issue in the above test code which might have been the culprit behind this? Nonetheless, PR #15901 deprecates INT96 timestamps altogether indirectly solving this.

Issue TLDR: In stats = pq_file.metadata.row_group(rg).column(i).statistics fetches the stats for ith column (0th col is index column) whereas col in enumerate(pd_slice) enumerates from the first column after the index column.

Proposed fix:

    # Read back from pyarrow
    pq_file = pq.ParquetFile(gdf_fname)
    # verify each row group's statistics
    for rg in range(0, pq_file.num_row_groups):
        pd_slice = pq_file.read_row_group(rg).to_pandas()
        # statistics are per-column. So need to verify independently
        for i, col in enumerate(pd_slice):
            stats = pq_file.metadata.row_group(rg).column(i + 1).statistics
            actual_min = cudf.Series(pd_slice[col].explode().explode()).min()
            stats_min = stats.min
            assert normalized_equals(actual_min, stats_min)
            actual_max = cudf.Series(pd_slice[col].explode().explode()).max()
            stats_max = stats.max
            assert normalized_equals(actual_max, stats_max)

mhaseeb123 avatar Jun 01 '24 01:06 mhaseeb123

Update from investigation

Pyarrow does not read or write stats for int96 columns even when parquet files were written by pyarrow itself. (See the demo code below).

I have locally verified that cuDF computes correct column Statistics (min_value, max_value, has_nulls, null_count, ...) in write_parquet for int96 columns as well, and writes them. I have also verified that the stats were written by hacking into read_parquet_metadata API and reading the Statistics object.

For now (not too important) and if needed, we can just enable computing and writing int96 statistics in the libcudf code so cuDF and other readers can read them.

Demo code

# Modified test case from `test_parquet_writer_statistics` from test_parquet.py

@pytest.mark.parametrize("int96_timestamps", [True, False])
def test_parquet_writer_statistics(tmpdir, pdf, int96_timestamps):
    file_path = tmpdir.join("cudf.parquet")
    gdf = cudf.from_pandas(pdf)
    gdf = gdf["col_datetime64[ms]"].reset_index(drop=True)[0:5].astype("datetime64[ms]")

    #
    # Use either pyarrow or cudf to write parquet with int96_timestamps for test purposes
    #

    # Use pyarrow
    pa_table = gdf.to_arrow()
    pq.write_table(pa_table, file_path, use_deprecated_int96_timestamps=int96_timestamps)

    # Use cudf (uncomment)
    # gdf.to_parquet(file_path, index=False, int96_timestamps=int96_timestamps)

    # Read back from pyarrow
    pq_file = pq.ParquetFile(file_path)
    # verify each row group's statistics
    for rg in range(0, pq_file.num_row_groups):
        pd_slice = pq_file.read_row_group(rg).to_pandas()

        # statistics are per-column. So need to verify independently
        for i, col in enumerate(pd_slice):
            stats = pq_file.metadata.row_group(rg).column(i).statistics

            if stats is not None:
                actual_min = pd_slice[col].min()
                stats_min = stats.min
                assert normalized_equals(actual_min, stats_min)

                actual_max = pd_slice[col].max()
                stats_max = stats.max
                assert normalized_equals(actual_max, stats_max)

                assert stats.null_count == pd_slice[col].isna().sum()
                assert stats.num_values == pd_slice[col].count()

            else:
                # Same print with either PQ writer for timestamp columns
                print(col, "does not have stats")

mhaseeb123 avatar Jun 06 '24 04:06 mhaseeb123

INT96 has been long deprecated, so I don't see much use in trying to get statistics to work with them. In fact, the parquet thrift file specifically says the sort order is undefined for INT96, so statistics are unusable anyway. The correct behavior IMO is to simply not write statistics for INT96.

etseidl avatar Jun 06 '24 16:06 etseidl

INT96 has been long deprecated, so I don't see much use in trying to get statistics to work with them. In fact, the parquet thrift file specifically says the sort order is undefined for INT96, so statistics are unusable anyway. The correct behavior IMO is to simply not write statistics for INT96.

Make sense. We can close this issue then.

PointKernel avatar Jun 06 '24 16:06 PointKernel

Sounds good. Closing both the PR and the issue!

mhaseeb123 avatar Jun 06 '24 18:06 mhaseeb123