parquet-go The column path is always override by the last modification

The column path is always override by the last modification

Open csimplestring opened this issue 2 years ago • 0 comments

Describe the bug I have a schema

                  message test {
			optional group a {
				optional group foo (MAP) {
					repeated group key_value {
						required binary key (STRING);
						optional binary value (STRING);
					}
				}
			}
		}

The problem is when I write the data into file, no error, seems ok. But. when I use. the 'parquet-tools' to cat the parquet file, it gives error:

java.lang.IllegalArgumentException: [a, foo, key_value, key] required binary key (STRING) is not in the store: [[a, foo, key_value, value] optional binary value (STRING)] 1
	at org.apache.parquet.hadoop.ColumnChunkPageReadStore.getPageReader(ColumnChunkPageReadStore.java:272)
	at org.apache.parquet.tools.command.DumpCommand.dump(DumpCommand.java:246)
	at org.apache.parquet.tools.command.DumpCommand.dump(DumpCommand.java:195)
	at org.apache.parquet.tools.command.DumpCommand.execute(DumpCommand.java:148)
	at org.apache.parquet.tools.Main.main(Main.java:223)
java.lang.IllegalArgumentException: [a, foo, key_value, key] required binary key (STRING) is not in the store: [[a, foo, key_value, value] optional binary value (STRING)] 1

Unit test to reproduce Described as above.

I guess the root cause is: in schema.go

func recursiveFix(col *Column, colPath ColumnPath, maxR, maxD uint16, alloc *allocTracker) {
	.......
	col.maxR = maxR
	col.maxD = maxD
        // at line 684, the append function internally always update the underlying array 
	col.path = append(colPath, col.name)
	if col.data != nil {
		col.data.reset(col.rep, col.maxR, col.maxD)
		return
	}

	for i := range col.children {
                 // so no matter how many children are, the colPath is alway the last child's path due to the bug in line 684
		recursiveFix(col.children[i], col.path, maxR, maxD, alloc)
	}
}

so the quick fix should be

         // copy the parent path first
	col.path = append([]string(nil), colPath...)
	col.path = append(col.path, col.name)

parquet-go specific details

What version are you using? 0.12.0
Can this be reproduced in earlier versions? not sure.

Misc Details

Are you using AWS Athena, Google BigQuery, presto... ? No, just normal parquet file.
Any other relevant details... how big are the files / rowgroups you're trying to read/write? A very small file.
Does this behavior exist in other implementations? (link to spec/implementation please)
Do you have memory stats to share?
Can you provide a stacktrace?
Can you upload a test file?

Jan 20 '23 06:01 csimplestring

parquet-go parquet-go copied to clipboard

The column path is always override by the last modification

parquet-go
parquet-go copied to clipboard