delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

feat: add stats to convert-to-delta operation

Open gruuya opened this issue 9 months ago • 2 comments

Description

Collect stats during conversion of a parquet dir to a Delta table and add to the actions.

Related Issue(s)

Closes #2490

Documentation

gruuya avatar May 08 '24 15:05 gruuya

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

github-actions[bot] avatar May 08 '24 15:05 github-actions[bot]

I'm not sure why the parse decimal overflow appears in the Python tests.

It's indicative that I see the decimal column min/max ends up as Value(Number(2560.0))/Value(Number(3584.0)) even though the pyarrrow seems to be properly persisting the stats to the file

% parquet meta python/tests/part-0.parquet

File path:  python/tests/part-0.parquet
Created by: parquet-cpp-arrow version 15.0.2
Properties:
  ARROW:schema: /////5ADAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAA4AAAAwAwAA5AIAALACAAB8AgAASAIAABACAADgAQAAtAEAAIgBAABMAQAAHAEAAOgAAABkAAAABAAAABj9//8AAAEMFAAAABwAAAAEAAAAAQAAABQAAAAEAAAAbGlzdAAAAAAM/f//RP3//wAAAQIQAAAAGAAAAAQAAAAAAAAABAAAAGl0ZW0AAAAAfP3//wAAAAFAAAAAdP3//wAAAQ0YAAAAIAAAAAQAAAACAAAAPAAAABQAAAAGAAAAc3RydWN0AABs/f//pP3//wAAAQUQAAAAFAAAAAQAAAAAAAAAAQAAAHkAAACQ/f//yP3//wAAAQIQAAAAFAAAAAQAAAAAAAAAAQAAAHgAAAD8/f//AAAAAUAAAAD0/f//AAABChAAAAAcAAAABAAAAAAAAAAJAAAAdGltZXN0YW1wAAAA8v7//wAAAgAk/v//AAABCBAAAAAYAAAABAAAAAAAAAAGAAAAZGF0ZTMyAAAe////AAAAAFD+//8AAAEHEAAAACAAAAAEAAAAAAAAAAcAAABkZWNpbWFsAAgADAAEAAgACAAAAAUAAAADAAAAiP7//wAAAQQQAAAAGAAAAAQAAAAAAAAABgAAAGJpbmFyeQAAeP7//7D+//8AAAEGEAAAABgAAAAEAAAAAAAAAAQAAABib29sAAAAAKD+///Y/v//AAABAxAAAAAYAAAABAAAAAAAAAAHAAAAZmxvYXQ2NADS////AAACAAT///8AAAEDEAAAACAAAAAEAAAAAAAAAAcAAABmbG9hdDMyAAAABgAIAAYABgAAAAAAAQA4////AAABAhAAAAAYAAAABAAAAAAAAAAEAAAAaW50OAAAAABw////AAAAAQgAAABo////AAABAhAAAAAYAAAABAAAAAAAAAAFAAAAaW50MTYAAACg////AAAAARAAAACY////AAABAhAAAAAYAAAABAAAAAAAAAAFAAAAaW50MzIAAADQ////AAAAASAAAADI////AAABAhAAAAAgAAAABAAAAAAAAAAFAAAAaW50NjQAAAAIAAwACAAHAAgAAAAAAAABQAAAABAAFAAIAAYABwAMAAAAEAAQAAAAAAABBRAAAAAcAAAABAAAAAAAAAAEAAAAdXRmOAAAAAAEAAQABAAAAAAAAAA=
Schema:
message schema {
  optional binary utf8 (STRING);
...
  optional fixed_len_byte_array(3) decimal (DECIMAL(5,3));
...


Row group 0:  count: 5  288.20 B records  start: 4  total(compressed): 1.407 kB total(uncompressed):1.408 kB
--------------------------------------------------------------------------------
                   type      encodings count     avg size   nulls   min / max
utf8               BINARY    S _ R     5         16.00 B    0       "0" / "4"
...
decimal            FIXED[3] S _ R     5         17.00 B  0       "10.000" / "14.000"
...

EDIT: I think this was caused by an unidentified bug in the statistics parser that is resolved by 5972aab07723fe11243c017f1938b96b70d45810

gruuya avatar May 08 '24 21:05 gruuya

Hi @roeap, @ion-elgreco, I think this is ready for review. There's both a feat added (convert stats) as well as a bugfix (decimal stats parsing from byte arrays) now.

Can you provide some feedback on it? Thanks!

gruuya avatar May 14 '24 07:05 gruuya

Thankss

ion-elgreco avatar May 15 '24 05:05 ion-elgreco

@gruuya can you rebase so that we can merge

ion-elgreco avatar May 15 '24 06:05 ion-elgreco

@ion-elgreco weirdly a couple of tests seem to be stuck, and have been running close to 2h hours now (test_concurrency_local, test_integration_local). I presume this is a CI problem of some sort (the previous runs were ok), can you restart the job please?

gruuya avatar May 15 '24 09:05 gruuya

@gruuya yeah it's a flaky test, re-executed it

ion-elgreco avatar May 15 '24 09:05 ion-elgreco