orc
orc copied to clipboard
Verify row index implementation
Not sure whether the existing row index implementation is correct. The documentation is slightly hard to interpret. Particularly these sections from https://orc.apache.org/docs/spec-index.html:
To record positions, each stream needs a sequence of numbers. For uncompressed streams, the position is the byte offset of the RLE run’s start location followed by the number of values that need to be consumed from the run. In compressed streams, the first number is the start of the compression chunk in the stream, followed by the number of decompressed bytes that need to be consumed, and finally the number of values consumed in the RLE.
For columns with multiple streams, the sequences of positions in each stream are concatenated. That was an unfortunate decision on my part that we should fix at some point, because it makes code that uses the indexes error-prone.
I think I am running into a bug due to row indexes.
If I write over 10,000 rows to a single file then athena returns the following error
HIVE_CURSOR_ERROR: index (0) must be less than size (0)
orc-tools doesn't have a problem reading meta or scanning the file though
@scritchley is there any update on the correctness of using the row index implementation?