Offset overflow errors can be confusing for users
When using binary or string columns a single batch of data cannot contain more than 2GiB of data. Users will either need to use large_binary and large_string or make sure to set a custom batch size when reading this data.
However, the error they run into, an "offset overflow" error, is a panic (not great) and very confusing. It is not obvious that the solution is to reduce the batch size:
thread 'lance_background_thread' panicked at .../arrow-data-52.2.0/src/transform/utils.rs:42:56:
offset overflow
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at .../rust/lance-encoding/src/decoder.rs:1267:65:
called `Result::unwrap()` on an `Err` value: JoinError::Panic(Id(12814), ...)
Ideally we should be returning an Err here (not panic) and the message should say something like "Could not create array with more than 2GiB of string/binary data. Please try reducing the batch_size."
@klibiadam looks like you have a very good start with this issue!
Would you like me to assign this issue to you?
As a simpler fix, Arrow has an API: try_append_value which might help with this.
One place a user can get this is during compaction. Is there a way we can possibly avoid this during compaction?
Kind of inconvenient but there is a batch_size in compaction options. There is also a LANCE_DEFAULT_BATCH_SIZE environment variable which can change the default if unspecified. Neither are great solutions but they can help as workarounds.