Bug: Merge insert can fail due to referencing invalid fragment IDs
I am periodically seeing errors like this with my call to merge_insert:
Caused by:
0: LanceError(IO): Execution error: The input to a take operation specified fragment id 0 but this fragment does not exist in the dataset, /Users/ogchen/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/lance-0.38.2/src/dataset/write/merge_insert.rs:851:46
1: Execution error: The input to a take operation specified fragment id 0 but this fragment does not exist in the dataset
2: Execution error: The input to a take operation specified fragment id 0 but this fragment does not exist in the dataset
My app is currently batching up rows every 10 seconds, calling merge_insert based on an ID column, and then calling optimize afterwards.
I have noticed that setting use_index to false on merge_insert fixes this problem, so I am suspecting that the index on the ID column can sometimes be corrupted?
Equally I have found that the order of compaction vs optimizing index after the call to merge_insert matters. I only see the issue if I do compaction -> index optimize, but do not see the issue with index optimize -> compaction.
Note: This is with lancedb version 0.22.2
Thanks for reporting this! We should take a look soon.
I’ve been experiencing the same issue since upgrading from python lancedb version 0.24.3 to version 0.25.3.
When this occurs, it appears that recreating the index is the only way to restore functionality.
I’ve been experiencing the same issue since upgrading from python lancedb version 0.24.3 to version 0.25.3.
When this occurs, it appears that recreating the index is the only way to restore functionality.
Could you share more information? That might help us reproduce and find a fix.
What kind of index does this happen with? On what data type? And any indication of what operations happen between a good index state and an error like this?
@wjones127
Here is the additional information regarding our setup and the issue:
Index and Data Type
- Index Type:
BTree - Column/Data Type: We are merging insert on the
idcolumn (Type:[String]). It's a hash value of the chunk content.
Other columns are basically content (string), embedding (4096 vector), and a few additional string columns just for filtering
Workflow
We perform incremental ingestion. For every new index job, we collect all the chunks, run a merge_insert followed by an optimize. The workflow looks like this:
# 1. Ensure Index exists
if not _has_index_type(vdb, "id", "BTree"):
await vdb.create_index(col, config=BTree())
# 2. Upsert Logic
await ( vdb.merge_insert("id")
.when_matched_update_all()
.when_not_matched_insert_all()
.execute(payload, on_bad_vectors="drop")
)
# 3. Optimization
await vdb.optimize(cleanup_older_than=timedelta(days=1))
Environment & Context
- Versions: The issue started appearing after upgrading
lancedbfrom0.24.3to0.25.3. - Scale: We have multiple folders of different lancedb dataset, but this seems to affect folders that > 500M disk size ones.
Reproduction & Observations I haven't been able to reproduce this from scratch locally, but I have analyzed a copy of the corrupted dataset:
- Partial Corruption: Not all incoming chunks trigger the error. Some specific chunks will fail, but if I modify those chunks slightly (random modifications), they succeed.
- Resolution: If I recreate the index on the broken dataset, the issue resolves immediately, and ingestion works normally again.
I also saw a same issue. I suspected data corruption creeped in somehow. I had to restore the data to previous version to fix the error
if this helps, here is a small dummy dataset that I was able to corrupt locally, and I think is related to this error: https://we.tl/t-DIU9m36Ixb
db = lancedb.connect(
"/Users/ogchen/Documents",
)
tbl = db.open_table("snapshot_b7deb25f-29ee-49b5-a122-e97ecd64e91a")
schema = pa.schema(
[
pa.field("id", pa.string(), nullable=False),
pa.field("file_details__content_type", pa.string()),
pa.field("file_details__size", pa.int64()),
pa.field("file_details__last_modified", pa.timestamp("ms")),
pa.field("file_details__extension", pa.string()),
pa.field("file_details__path", pa.string()),
pa.field("file_details__type", pa.string()),
]
)
new_data = pa.table(
{
"id": ["fil_01KAH4XFV0E9SSE27AVAGN5E68"],
"file_details__content_type": ["text/plain"],
"file_details__size": [100],
"file_details__last_modified": [time.time()],
"file_details__extension": ["txt"],
"file_details__path": ["blablabla.txt"],
"file_details__type": ["text"],
},
schema=schema,
)
print(
tbl.merge_insert("id")
.when_matched_update_all()
.when_not_matched_insert_all()
.execute(new_data)
)
running this script I see error
RuntimeError: lance error: Query Execution error: Execution error: The input to a take operation specified fragment id 0 but this fragment does not exist in the dataset, /rustc/1159e78c4747b02ef996e55082b704c09b970588/library/core/src/task/poll.rs:290:44
Thanks for the repro, this is very helpful!
So the problem I see is that the fragment_bitmap on the id_idx is {2}, which doesn't contain fragment zero. When I introspect the index, I can see row ids from fragment zero:
>>> row_ids = LanceFileReader(data_path).read_all().to_table()['ids'].to_pylist()
>>> row_ids
[6, 8589934598, 3, 8589934595, 10, 8589934602, 4, 8589934596, 8, 8589934600, 5, 8589934597, 0, 8589934592, 1, 8589934593, 7, 8589934599, 2, 8589934594, 8589934613, 9, 8589934601, 8589934606, 8589934605, 8589934603, 8589934610, 8589934611, 8589934614, 8589934604, 8589934612, 8589934608, 8589934609, 8589934607]
>>> frag_ids = [id >> 32 for id in row_ids]
>>> frag_ids
[0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
We are supposed to use that fragment bitmap to mask out any deleted fragments. However, since it's not present, it doesn't know to mask it out and we get that error:
https://github.com/lance-format/lance/blob/abce5a5a7c05d515c3b0edf6add3f95518390ecc/rust/lance/src/io/exec/scalar_index.rs#L361
So the question is: what happened to that fragment id? And also, what should we do when we encounter corrupted fragment_bitmaps like this?
Okay, here is a reproduction. The key is to:
- Upsert with columns out of order
- Compact
- Optimize indices
- Try to upsert again, with a key that would have matched the original data that was updated in step (1).
import lance
import pyarrow as pa
data = pa.table({'id': [1], 'value': ['a']})
ds = lance.write_dataset(data, 'memory://', max_rows_per_file=1)
ds.create_scalar_index('id', 'BTREE')
data_reversed = pa.table({'value': ['b', 'a'], 'id': [2, 1] })
(
ds.merge_insert(on='id')
.when_matched_update_all()
.when_not_matched_insert_all()
.execute(data_reversed)
)
ds.optimize.compact_files()
ds.optimize.optimize_indices()
(
ds.merge_insert(on='id')
.when_matched_update_all()
.when_not_matched_insert_all()
.execute(data)
)
OSError: Query Execution error: Execution error: The input to a take operation specified fragment id 0 but this fragment does not exist in the dataset, /Users/willjones/.rustup/toolchains/1.90.0-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/task/poll.rs:290:44
I'm going to open an issue upstream with this repro.