parquet-go icon indicating copy to clipboard operation
parquet-go copied to clipboard

Data inconsistency when parsing bit string exported from Aurora

Open buchuitoudegou opened this issue 2 years ago • 0 comments

Hi, when I tried parsing parquet file exported from Aurora with parquet-go, I met an unexpected data inconsistency. The columns in my table are binary type, such as BINARY, BLOB, LONGBLOB, etc. The complete table schema is shown as the following:

CREATE TABLE jxtest(
    `id` char(36) NOT NULL,
    `a` bigint unsigned NOT NULL,
    `aa` bigint signed NOT NULL,
    `b` int(11) unsigned NOT NULL,
    `bb` int(11) signed NOT NULL,
    `c` smallint signed NOT NULL,
    `cc` smallint unsigned NOT NULL,
    `d` tinyint signed NOT NULL,
    `dd` tinyint unsigned NOT NULL,
    `e` float unsigned NOT NULL,
    `ee` float signed NOT NULL,
    `f` VARCHAR(30) NOT NULL,
    `ff` TEXT NOT NULL,
    `h` MEDIUMTEXT NOT NULL,
    `hh` LONGTEXT NOT NULL,
    `ii` TINYTEXT NOT NULL,
    `j` DECIMAL NOT NULL,
    `jj` DECIMAL(8,0) NOT NULL,
    `k` DECIMAL(8,8) NOT NULL,
    `kk` DECIMAL(20,0) NOT NULL,
    `l` DECIMAL(20,8) NOT NULL,
    `ll` DECIMAL(36,0) NOT NULL,
    `m` DECIMAL(36,8) NOT NULL,
    `mm` DATE NOT NULL,
    `n` TIME NOT NULL,
    `nn` YEAR NOT NULL,
    `o` DATETIME NOT NULL,
    `oo` BINARY NOT NULL,
    `p` BLOB NOT NULL,
    `pp` LONGBLOB NOT NULL,
    `q` MEDIUMBLOB NOT NULL,
    `qq` TINYBLOB NOT NULL,
    `rr` BIT NOT NULL,
    `s` BOOLEAN NOT NULL,
    `ss` DOUBLE signed NOT NULL,
    `t` DOUBLE unsigned NOT NULL,
    PRIMARY KEY ( `id` ),
    KEY `index_a` (`a`) );

When I parsed the parquet file using PyArrow, the result is: image Where they are bit string. But when I parsed them with parquet-go, the output associated with its schema is presented as the following:

schema element: SchemaElement({Type:BYTE_ARRAY TypeLength:<nil> RepetitionType:OPTIONAL Name:P NumChildren:<nil> ConvertedType:<nil> Scale:<nil> Precision:<nil> FieldID:<nil> LogicalType:<nil>}), string: 111111111
schema element: SchemaElement({Type:BYTE_ARRAY TypeLength:<nil> RepetitionType:OPTIONAL Name:Pp NumChildren:<nil> ConvertedType:<nil> Scale:<nil> Precision:<nil> FieldID:<nil> LogicalType:<nil>}), string: 1111111111
schema element: SchemaElement({Type:BYTE_ARRAY TypeLength:<nil> RepetitionType:OPTIONAL Name:Q NumChildren:<nil> ConvertedType:<nil> Scale:<nil> Precision:<nil> FieldID:<nil> LogicalType:<nil>}), string: 111111111

The result of them should've been 0x7F or something like that, but I got a plain text string 111111111 which is definitely not equal to b'111111111'.

Would you please explain about it? Thank you all in advance.

buchuitoudegou avatar Sep 13 '22 06:09 buchuitoudegou