delta-rs
delta-rs copied to clipboard
feat: add stats to convert-to-delta operation
Description
Collect stats during conversion of a parquet dir to a Delta table and add to the actions.
Related Issue(s)
Closes #2490
Documentation
ACTION NEEDED
delta-rs follows the Conventional Commits specification for release automation.
The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.
I'm not sure why the parse decimal overflow
appears in the Python tests.
It's indicative that I see the decimal column min/max ends up as Value(Number(2560.0))
/Value(Number(3584.0))
even though the pyarrrow seems to be properly persisting the stats to the file
% parquet meta python/tests/part-0.parquet
File path: python/tests/part-0.parquet
Created by: parquet-cpp-arrow version 15.0.2
Properties:
ARROW:schema: /////5ADAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAA4AAAAwAwAA5AIAALACAAB8AgAASAIAABACAADgAQAAtAEAAIgBAABMAQAAHAEAAOgAAABkAAAABAAAABj9//8AAAEMFAAAABwAAAAEAAAAAQAAABQAAAAEAAAAbGlzdAAAAAAM/f//RP3//wAAAQIQAAAAGAAAAAQAAAAAAAAABAAAAGl0ZW0AAAAAfP3//wAAAAFAAAAAdP3//wAAAQ0YAAAAIAAAAAQAAAACAAAAPAAAABQAAAAGAAAAc3RydWN0AABs/f//pP3//wAAAQUQAAAAFAAAAAQAAAAAAAAAAQAAAHkAAACQ/f//yP3//wAAAQIQAAAAFAAAAAQAAAAAAAAAAQAAAHgAAAD8/f//AAAAAUAAAAD0/f//AAABChAAAAAcAAAABAAAAAAAAAAJAAAAdGltZXN0YW1wAAAA8v7//wAAAgAk/v//AAABCBAAAAAYAAAABAAAAAAAAAAGAAAAZGF0ZTMyAAAe////AAAAAFD+//8AAAEHEAAAACAAAAAEAAAAAAAAAAcAAABkZWNpbWFsAAgADAAEAAgACAAAAAUAAAADAAAAiP7//wAAAQQQAAAAGAAAAAQAAAAAAAAABgAAAGJpbmFyeQAAeP7//7D+//8AAAEGEAAAABgAAAAEAAAAAAAAAAQAAABib29sAAAAAKD+///Y/v//AAABAxAAAAAYAAAABAAAAAAAAAAHAAAAZmxvYXQ2NADS////AAACAAT///8AAAEDEAAAACAAAAAEAAAAAAAAAAcAAABmbG9hdDMyAAAABgAIAAYABgAAAAAAAQA4////AAABAhAAAAAYAAAABAAAAAAAAAAEAAAAaW50OAAAAABw////AAAAAQgAAABo////AAABAhAAAAAYAAAABAAAAAAAAAAFAAAAaW50MTYAAACg////AAAAARAAAACY////AAABAhAAAAAYAAAABAAAAAAAAAAFAAAAaW50MzIAAADQ////AAAAASAAAADI////AAABAhAAAAAgAAAABAAAAAAAAAAFAAAAaW50NjQAAAAIAAwACAAHAAgAAAAAAAABQAAAABAAFAAIAAYABwAMAAAAEAAQAAAAAAABBRAAAAAcAAAABAAAAAAAAAAEAAAAdXRmOAAAAAAEAAQABAAAAAAAAAA=
Schema:
message schema {
optional binary utf8 (STRING);
...
optional fixed_len_byte_array(3) decimal (DECIMAL(5,3));
...
Row group 0: count: 5 288.20 B records start: 4 total(compressed): 1.407 kB total(uncompressed):1.408 kB
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
utf8 BINARY S _ R 5 16.00 B 0 "0" / "4"
...
decimal FIXED[3] S _ R 5 17.00 B 0 "10.000" / "14.000"
...
EDIT: I think this was caused by an unidentified bug in the statistics parser that is resolved by 5972aab07723fe11243c017f1938b96b70d45810
Hi @roeap, @ion-elgreco, I think this is ready for review. There's both a feat added (convert stats) as well as a bugfix (decimal stats parsing from byte arrays) now.
Can you provide some feedback on it? Thanks!
Thankss
@gruuya can you rebase so that we can merge
@ion-elgreco weirdly a couple of tests seem to be stuck, and have been running close to 2h hours now (test_concurrency_local
, test_integration_local
). I presume this is a CI problem of some sort (the previous runs were ok), can you restart the job please?
@gruuya yeah it's a flaky test, re-executed it