[Bug]: Mismatch between parquet schema and record schema
What happened?
The values read were incorrect due to a mismatch between the schema of the values and the table schema, which had not been changed.
Intuitively, the difference between the parquet file in which the error occurred and the other parquet files is the encoding; the file in which the error occurred is encoded in GZIP, while the other files are encoded in ZSTD, which is the same as the table attribute write.parquet.compression-codec.
CREATE TABLE iceberg.pda_sparkdata.callduration_spark_ice (
`orgid` STRING,
`time_stamp` STRING,
`userid` STRING,
`legduration` INT,
`call_id` STRING,
`locus` STRING,
`start_time` STRING,
`uatype` STRING,
`devicetype` STRING,
`uaversion` STRING,
`crid` STRING,
`source` STRING,
`has_audio` BOOLEAN,
`has_video` BOOLEAN,
`dataid` STRING,
`relation_name` STRING,
`pdate` STRING)
USING iceberg
PARTITIONED BY (pdate, bucket(16, orgid))
LOCATION 'hdfs://nameservice1/user/hive/warehouse/pda_sparkdata.db/callduration_spark_ice'
TBLPROPERTIES(
'commit.retry.min-wait-ms' = '1000',
'commit.retry.num-retries' = '25',
'current-snapshot-id' = '6047467073886370293',
'custom-table-version' = '2',
'format' = 'iceberg/parquet',
'history.expire.max-snapshot-age-ms' = '259200000',
'self-optimizing.enabled' = 'true',
'self-optimizing.group' = 'DLS-AMSFlink',
'write.metadata.delete-after-commit.enabled' = 'true',
'write.metadata.metrics.default' = 'full',
'write.parquet.compression-codec' = 'zstd')
Affects Versions
master
What engines are you seeing the problem on?
Optimizer
How to reproduce
No response
Relevant log output
2023-11-15 10:39:25,991 INFO [main] [org.apache.hadoop.io.compress.CodecPool] [] - Got brand-new decompressor [.gz]
Exception in thread "main" java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.CharSequence
at org.apache.iceberg.parquet.ParquetValueWriters$StringWriter.write(ParquetValueWriters.java:324)
at org.apache.iceberg.parquet.ParquetValueWriters$OptionWriter.write(ParquetValueWriters.java:356)
at org.apache.iceberg.parquet.ParquetValueWriters$StructWriter.write(ParquetValueWriters.java:589)
at org.apache.iceberg.parquet.ParquetWriter.add(ParquetWriter.java:138)
at org.apache.iceberg.io.DataWriter.write(DataWriter.java:71)
at org.apache.iceberg.io.RollingFileWriter.write(RollingFileWriter.java:90)
at org.apache.iceberg.io.RollingDataWriter.write(RollingDataWriter.java:32)
at com.netease.arctic.optimizing.AbstractRewriteFilesExecutor.rewriterDataFiles(AbstractRewriteFilesExecutor.java:150)
at com.netease.arctic.table.TableMetaStore.call(TableMetaStore.java:234)
at com.netease.arctic.table.TableMetaStore.lambda$doAs$0(TableMetaStore.java:209)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:360)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1742)
at com.netease.arctic.table.TableMetaStore.doAs(TableMetaStore.java:209)
at com.netease.arctic.io.ArcticHadoopFileIO.doAs(ArcticHadoopFileIO.java:200)
at com.netease.arctic.optimizing.AbstractRewriteFilesExecutor.execute(AbstractRewriteFilesExecutor.java:105)
at com.netease.arctic.optimizing.AbstractRewriteFilesExecutor.execute(AbstractRewriteFilesExecutor.java:61)
Anything else
Incorrect parquet file schema:
creator: parquet-mr version 1.12.3 (build f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)
extra: iceberg.schema = {"type":"struct","schema-id":12,"fields":[{"id":9,"name":"orgid","required":false,"type":"string"},{"id":3,"name":"time_stamp","required":false,"type":"string"},{"id":8, [more]...
file schema: table
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
orgid: OPTIONAL BINARY O:UTF8 R:0 D:1
time_stamp: OPTIONAL BINARY O:UTF8 R:0 D:1
userid: OPTIONAL BINARY O:UTF8 R:0 D:1
legduration: OPTIONAL INT32 R:0 D:1
call_id: OPTIONAL BINARY O:UTF8 R:0 D:1
locus: OPTIONAL BINARY O:UTF8 R:0 D:1
start_time: OPTIONAL BINARY O:UTF8 R:0 D:1
uatype: OPTIONAL BINARY O:UTF8 R:0 D:1
devicetype: OPTIONAL BINARY O:UTF8 R:0 D:1
uaversion: OPTIONAL BINARY O:UTF8 R:0 D:1
crid: OPTIONAL BINARY O:UTF8 R:0 D:1
source: OPTIONAL BINARY O:UTF8 R:0 D:1
has_audio: OPTIONAL BOOLEAN R:0 D:1
has_video: OPTIONAL BOOLEAN R:0 D:1
dataid: OPTIONAL BINARY O:UTF8 R:0 D:1
relation_name: OPTIONAL BINARY O:UTF8 R:0 D:1
pdate: OPTIONAL BINARY O:UTF8 R:0 D:1
row group 1: RC:1957 TS:395047
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
orgid: BINARY GZIP DO:4 FPO:853 SZ:1119/1829/1.63 VC:1957 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
time_stamp: BINARY GZIP DO:1123 FPO:2601 SZ:3432/12291/3.58 VC:1957 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
userid: BINARY GZIP DO:0 FPO:4555 SZ:42969/78315/1.82 VC:1957 ENC:BIT_PACKED,PLAIN,RLE
legduration: INT32 GZIP DO:0 FPO:47524 SZ:4126/7861/1.91 VC:1957 ENC:BIT_PACKED,PLAIN,RLE
call_id: BINARY GZIP DO:51650 FPO:57434 SZ:6841/17997/2.63 VC:1957 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
locus: BINARY GZIP DO:58491 FPO:62437 SZ:5003/11526/2.30 VC:1957 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
start_time: BINARY GZIP DO:63494 FPO:65137 SZ:2700/8413/3.12 VC:1957 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
uatype: BINARY GZIP DO:66194 FPO:66349 SZ:532/803/1.51 VC:1957 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
devicetype: BINARY GZIP DO:66726 FPO:66855 SZ:499/779/1.56 VC:1957 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
uaversion: BINARY GZIP DO:67225 FPO:67414 SZ:559/729/1.30 VC:1957 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
crid: BINARY GZIP DO:67784 FPO:108687 SZ:43659/76161/1.74 VC:1957 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
source: BINARY GZIP DO:111443 FPO:111491 SZ:100/62/0.62 VC:1957 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
has_audio: BOOLEAN GZIP DO:0 FPO:111543 SZ:56/277/4.95 VC:1957 ENC:BIT_PACKED,PLAIN,RLE
has_video: BOOLEAN GZIP DO:0 FPO:111599 SZ:56/277/4.95 VC:1957 ENC:BIT_PACKED,PLAIN,RLE
dataid: BINARY GZIP DO:0 FPO:111655 SZ:57794/177591/3.07 VC:1957 ENC:BIT_PACKED,PLAIN,RLE
relation_name: BINARY GZIP DO:169449 FPO:169504 SZ:107/69/0.64 VC:1957 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
pdate: BINARY GZIP DO:169556 FPO:169609 SZ:105/67/0.64 VC:1957 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
I found the parquet schema is incorrect. The incorrect is like:
Created by: parquet-mr version 1.12.3 (build f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)
Properties:
iceberg.schema: {"type":"struct","schema-id":0,"fields":[{"id":1,"name":"orgid","required":false,"type":"string"},{"id":2,"name":"time_stamp","required":false,"type":"string"},{"id":3,"name":"userid","required":false,"type":"string"},{"id":4,"name":"legduration","required":false,"type":"int"},{"id":5,"name":"call_id","required":false,"type":"string"},{"id":6,"name":"locus","required":false,"type":"string"},{"id":7,"name":"start_time","required":false,"type":"string"},{"id":8,"name":"uatype","required":false,"type":"string"},{"id":9,"name":"devicetype","required":false,"type":"string"},{"id":10,"name":"uaversion","required":false,"type":"string"},{"id":11,"name":"crid","required":false,"type":"string"},{"id":12,"name":"source","required":false,"type":"string"},{"id":13,"name":"has_audio","required":false,"type":"boolean"},{"id":14,"name":"has_video","required":false,"type":"boolean"},{"id":15,"name":"dataid","required":false,"type":"string"},{"id":16,"name":"relation_name","required":false,"type":"string"},{"id":17,"name":"pdate","required":false,"type":"string"}]}
Schema:
message table {
optional binary orgid (STRING) = 1;
optional binary time_stamp (STRING) = 2;
optional binary userid (STRING) = 3;
optional int32 legduration = 4;
optional binary call_id (STRING) = 5;
optional binary locus (STRING) = 6;
optional binary start_time (STRING) = 7;
optional binary uatype (STRING) = 8;
optional binary devicetype (STRING) = 9;
optional binary uaversion (STRING) = 10;
optional binary crid (STRING) = 11;
optional binary source (STRING) = 12;
optional boolean has_audio = 13;
optional boolean has_video = 14;
optional binary dataid (STRING) = 15;
optional binary relation_name (STRING) = 16;
optional binary pdate (STRING) = 17;
}
The incorrect file is like:
Created by: parquet-mr version 1.12.3 (build f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)
Properties:
iceberg.schema: {"type":"struct","schema-id":12,"fields":[{"id":9,"name":"orgid","required":false,"type":"string"},{"id":3,"name":"time_stamp","required":false,"type":"string"},{"id":8,"name":"userid","required":false,"type":"string"},{"id":1,"name":"legduration","required":false,"type":"int"},{"id":12,"name":"call_id","required":false,"type":"string"},{"id":16,"name":"locus","required":false,"type":"string"},{"id":13,"name":"start_time","required":false,"type":"string"},{"id":14,"name":"uatype","required":false,"type":"string"},{"id":10,"name":"devicetype","required":false,"type":"string"},{"id":2,"name":"uaversion","required":false,"type":"string"},{"id":4,"name":"crid","required":false,"type":"string"},{"id":7,"name":"source","required":false,"type":"string"},{"id":5,"name":"has_audio","required":false,"type":"boolean"},{"id":11,"name":"has_video","required":false,"type":"boolean"},{"id":15,"name":"dataid","required":false,"type":"string"},{"id":17,"name":"relation_name","required":false,"type":"string"},{"id":6,"name":"pdate","required":false,"type":"string"}]}
Schema:
message table {
optional binary orgid (STRING) = 9;
optional binary time_stamp (STRING) = 3;
optional binary userid (STRING) = 8;
optional int32 legduration = 1;
optional binary call_id (STRING) = 12;
optional binary locus (STRING) = 16;
optional binary start_time (STRING) = 13;
optional binary uatype (STRING) = 14;
optional binary devicetype (STRING) = 10;
optional binary uaversion (STRING) = 2;
optional binary crid (STRING) = 4;
optional binary source (STRING) = 7;
optional boolean has_audio = 5;
optional boolean has_video = 11;
optional binary dataid (STRING) = 15;
optional binary relation_name (STRING) = 17;
optional binary pdate (STRING) = 6;
}
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'