orc Verify row index implementation

Verify row index implementation

Open scritchley opened this issue 7 years ago • 2 comments

Not sure whether the existing row index implementation is correct. The documentation is slightly hard to interpret. Particularly these sections from https://orc.apache.org/docs/spec-index.html:

To record positions, each stream needs a sequence of numbers. For uncompressed streams, the position is the byte offset of the RLE run’s start location followed by the number of values that need to be consumed from the run. In compressed streams, the first number is the start of the compression chunk in the stream, followed by the number of decompressed bytes that need to be consumed, and finally the number of values consumed in the RLE.

For columns with multiple streams, the sequences of positions in each stream are concatenated. That was an unfortunate decision on my part that we should fix at some point, because it makes code that uses the indexes error-prone.

Mar 16 '17 22:03 scritchley

I think I am running into a bug due to row indexes.

If I write over 10,000 rows to a single file then athena returns the following error

HIVE_CURSOR_ERROR: index (0) must be less than size (0)

orc-tools doesn't have a problem reading meta or scanning the file though

Oct 25 '17 22:10 mattatcha

@scritchley is there any update on the correctness of using the row index implementation?

Jul 21 '20 19:07 athum

orc orc copied to clipboard

Verify row index implementation

orc
orc copied to clipboard