Clarify that column names are sorted when schema is created.
This set of column definitions:
[]dynparquet.ColumnDefinition{
{
Name: "timestamp",
StorageLayout: parquet.Int(64),
Dynamic: false,
},
{
Name: "state",
StorageLayout: parquet.String(),
Dynamic: false,
},
{
Name: "labels",
StorageLayout: parquet.Encoded(parquet.Optional(parquet.String()), &parquet.RLEDictionary),
Dynamic: true,
},
}
will actually have column indexes:
{labels: 0, state: 1, timestamp: 2}
So, when inserting fields, you must get the column indexes by searching the schema's column slice, rather than based on your own definition.
Documentation on functions and/or examples where this happens would be useful to new contributors.
Aside: what is the reason that column names are sorted on schema creation?
We need to have them sorted so that when we merge two row groups with different dynamic columns we can perform a k-way merge. We’ve been thinking about improving the ergonomics of the insert API so that people don’t have to care about the index positions, but haven’t come up with anything satisfactory (yet).
Would you be willing to contribute some docs where it would have been helpful for you to understand that this sorting is happening?
Sure thing. I'll send a PR when I get a bit of spare time. If GH lets you, assigning it to me would be great - then it'll appear in my standard GH searches that I use.
Done!
I know it doesn't improve the original issue, but I just want to point out here how Parca uses a switch-case to get the correct columns and then builds the index around that (the columnIndex).
https://github.com/parca-dev/parca/blob/5bfbe7ebe340a8e0cfea28a2589117a83250744b/pkg/parcacol/sample.go#L128
@metalmatze thanks, that's a helpful example.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.