parquet-go
parquet-go copied to clipboard
The column path is always override by the last modification
Describe the bug I have a schema
message test {
optional group a {
optional group foo (MAP) {
repeated group key_value {
required binary key (STRING);
optional binary value (STRING);
}
}
}
}
The problem is when I write the data into file, no error, seems ok. But. when I use. the 'parquet-tools' to cat the parquet file, it gives error:
java.lang.IllegalArgumentException: [a, foo, key_value, key] required binary key (STRING) is not in the store: [[a, foo, key_value, value] optional binary value (STRING)] 1
at org.apache.parquet.hadoop.ColumnChunkPageReadStore.getPageReader(ColumnChunkPageReadStore.java:272)
at org.apache.parquet.tools.command.DumpCommand.dump(DumpCommand.java:246)
at org.apache.parquet.tools.command.DumpCommand.dump(DumpCommand.java:195)
at org.apache.parquet.tools.command.DumpCommand.execute(DumpCommand.java:148)
at org.apache.parquet.tools.Main.main(Main.java:223)
java.lang.IllegalArgumentException: [a, foo, key_value, key] required binary key (STRING) is not in the store: [[a, foo, key_value, value] optional binary value (STRING)] 1
Unit test to reproduce Described as above.
I guess the root cause is: in schema.go
func recursiveFix(col *Column, colPath ColumnPath, maxR, maxD uint16, alloc *allocTracker) {
.......
col.maxR = maxR
col.maxD = maxD
// at line 684, the append function internally always update the underlying array
col.path = append(colPath, col.name)
if col.data != nil {
col.data.reset(col.rep, col.maxR, col.maxD)
return
}
for i := range col.children {
// so no matter how many children are, the colPath is alway the last child's path due to the bug in line 684
recursiveFix(col.children[i], col.path, maxR, maxD, alloc)
}
}
so the quick fix should be
// copy the parent path first
col.path = append([]string(nil), colPath...)
col.path = append(col.path, col.name)
parquet-go specific details
- What version are you using? 0.12.0
- Can this be reproduced in earlier versions? not sure.
Misc Details
- Are you using AWS Athena, Google BigQuery, presto... ? No, just normal parquet file.
- Any other relevant details... how big are the files / rowgroups you're trying to read/write? A very small file.
- Does this behavior exist in other implementations? (link to spec/implementation please)
- Do you have memory stats to share?
- Can you provide a stacktrace?
- Can you upload a test file?