AzureStorageExplorer icon indicating copy to clipboard operation
AzureStorageExplorer copied to clipboard

Incorrect preview of parquet files with decimals

Open ihenry opened this issue 1 year ago • 7 comments

Preflight Checklist

Storage Explorer Version

1.33.1

Regression From

No response

Architecture

x64

Storage Explorer Build Number

20240410.2

Platform

All

OS Version

Windows 11 & MacOS 14.5

Bug Description

Incorrect preview of parquet files with multiple decimal precision (5,3), (9,5) and (38,6).

Steps to Reproduce

Previewing a parquet file in Azure Storage Explorer containing columns defined with various decimal precision (5,3), (9,5) and Decimal (38,6) shows incorrect results. The file should be previewed as in DBeaver with 0 or 0.xxx as appropriate. DBeaver with DuckDB shows the following preview Screenshot 2024-05-24 at 12 58 58

DBeaver with DuckDB Metadata Screenshot 2024-05-24 at 12 52 55

Azure Storage Explorer 1.33.1 Screenshot 2024-05-24 at 12 52 32

Actual Experience

Expecting to see raw values, but we actually see {"type":"Buffer","data":[0,0,0,0]}

ihenry avatar May 24 '24 12:05 ihenry

@ihenry Can you share your Parquet file or a small sample file that we could test with?

craxal avatar May 25 '24 00:05 craxal

Thanks @craxal. I havew emailed a sample parquet file to the sehelp mailbox.

ihenry avatar May 27 '24 11:05 ihenry

Issue reproduced on our end.

Are all of your decimal values intentionally 0? Every buffer that's parsed seems to contain only zeroes.

It seems that the library we use does not currently support decimal values (see https://github.com/LibertyDSNP/parquetjs#list-of-supported-types--encodings). We might be able to work around this by parsing the buffer ourselves.

craxal avatar Jun 11 '24 22:06 craxal

Yes, that extract was from a system that contains sample data. It appears the default value is zero. I have seen the same behaviour with non-zero decimal values too, that was real data which is more difficult to share.

ihenry avatar Jun 12 '24 09:06 ihenry

@ihenry I'm having trouble producing a Parquet file with non-zero decimal values encoded as byte arrays. Can you either provide another sample file or guidance as to how you produced the sample you emailed earlier?

craxal avatar Sep 03 '24 23:09 craxal

@craxal, I have emailed a new sample file to the sehelp email address. I am using SAP Datasphere to extract the data from an SAP system and land that in Microsoft Azure Data Lake Storage Gen2 target.

ihenry avatar Sep 05 '24 08:09 ihenry

Thank you very much.

Unfortunately, it does not look like parsing a buffer array into a displayable decimal ourselves is as trivial as I had hoped. If each element in the array had represented a base-10 digit, this would have been pretty straightforward, but it doesn't appear to be the case. Attempting to decode these values correctly makes me nervous and seems wasteful when the library already has support lined up (LibertyDSNP/parquetjs#91).

We'll keep this work item open for tracking, but we'll need to wait for library support.

craxal avatar Sep 09 '24 21:09 craxal

@ihenry I took another look at possibly parsing byte array encoded decimal values in Storage Explorer, and I'm feeling more confident about being able to do it correctly for preview. I'd like to double-check some things with you to make sure we do this correctly.

According to the spec, decimals are stored as two's compliment integers in big endian order, and the scale value from the schema tells us how many digits go to the right of the decimal.

I'll take the first row of data from the sample you sent as an example. The TS_DEC property has a scale of 12, so we can convert it to decimal notation as follows:

Form Value
Byte array [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 35, 134, 153, 7, 146, 0]
Hexadecimal 0x00000000000000000000238699079200
Two's Compliment 0x00000000000000000000238699079200
Decimal 39061000000000
Scale 39.061000000000
Form Value
Byte array [128, 0, 0, 0, 0, 0, 0, 0, 0, 0, 35, 134, 153, 7, 146, 0]
Hexadecimal 0x80000000000000000000238699079200
Two's Compliment 0x7FFFFFFFFFFFFFFFD39CC7996F7E0000
Decimal -170141183460469231731687264654884105728
Scale -170141183460469231731687264.654884105728

Can you confirm that is is the expected value? Are your other Parquet tools showing the same result?

The negative value I made up myself just to make sure negative values are covered.

If you have another sample that you can share for verification, that would also be helpful. Either that, or point us to a useful tool for generating the needed data. The sample should have:

  • Negative decimal values
  • A mixture of decimals encoded as fixed length byte arrays and byte arrays

craxal avatar Feb 21 '25 23:02 craxal

I managed to create a Parquet file using C# code containing decimals encoded as byte arrays and confirmed the above sequence successfully decodes decimals.

craxal avatar Mar 04 '25 01:03 craxal