lance icon indicating copy to clipboard operation
lance copied to clipboard

Offset overflow errors can be confusing for users

Open westonpace opened this issue 1 year ago • 1 comments

When using binary or string columns a single batch of data cannot contain more than 2GiB of data. Users will either need to use large_binary and large_string or make sure to set a custom batch size when reading this data.

However, the error they run into, an "offset overflow" error, is a panic (not great) and very confusing. It is not obvious that the solution is to reduce the batch size:

thread 'lance_background_thread' panicked at .../arrow-data-52.2.0/src/transform/utils.rs:42:56:
offset overflow
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at .../rust/lance-encoding/src/decoder.rs:1267:65:
called `Result::unwrap()` on an `Err` value: JoinError::Panic(Id(12814), ...)

Ideally we should be returning an Err here (not panic) and the message should say something like "Could not create array with more than 2GiB of string/binary data. Please try reducing the batch_size."

westonpace avatar Aug 22 '24 18:08 westonpace

@klibiadam looks like you have a very good start with this issue!

Would you like me to assign this issue to you?

broccoliSpicy avatar Aug 28 '24 19:08 broccoliSpicy

As a simpler fix, Arrow has an API: try_append_value which might help with this.

wjones127 avatar Oct 28 '25 00:10 wjones127

One place a user can get this is during compaction. Is there a way we can possibly avoid this during compaction?

wjones127 avatar Oct 30 '25 18:10 wjones127

Kind of inconvenient but there is a batch_size in compaction options. There is also a LANCE_DEFAULT_BATCH_SIZE environment variable which can change the default if unspecified. Neither are great solutions but they can help as workarounds.

westonpace avatar Nov 04 '25 03:11 westonpace