datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

chore: Log warning for tests that are not really testing dictionary-encoded Parquet files

Open andygrove opened this issue 1 year ago • 0 comments

Which issue does this PR close?

N/A

Rationale for this change

Many tests use the following pattern to cover testing with dictionary-encoded data.

    Seq("true", "false").foreach { dictionary =>
        withParquetTable(
          (-5 until 5).map(i => (i.toDouble + 0.3, i.toDouble + 0.8)),
          "tbl",
          withDictionary = dictionary.toBoolean) {

However, in many cases, including this example, the data does not contain repeated values and therefore does not cause any dictionary-encoded arrays to be created.

What changes are included in this PR?

Log a warning when withParquetTable is called with withDictionary=true and when the parquet file does not actually contain dictionary-encoded data.

I was originally going to add an assertion instead of a log message but it was too disruptive.

Example output from running tests:

- test multiple pages with mixed PLAIN_DICTIONARY and PLAIN encoding (prefetch enabled) (709 milliseconds)
- test multiple pages with mixed PLAIN_DICTIONARY and PLAIN encoding (524 milliseconds)
WARN: withParquetFile was called with withDictionary=true but did not write any dictionary-encoded columns
- skip vector re-loading (prefetch enabled) (669 milliseconds)
WARN: withParquetFile was called with withDictionary=true but did not write any dictionary-encoded columns

How are these changes tested?

No functional change to test. Manually tested that logging appears.

andygrove avatar Aug 01 '24 16:08 andygrove