lancedb icon indicating copy to clipboard operation
lancedb copied to clipboard

Bug: Merge insert can fail due to referencing invalid fragment IDs

Open oscar-humannative opened this issue 2 months ago • 1 comments

I am periodically seeing errors like this with my call to merge_insert:

Caused by:
    0: LanceError(IO): Execution error: The input to a take operation specified fragment id 0 but this fragment does not exist in the dataset, /Users/ogchen/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/lance-0.38.2/src/dataset/write/merge_insert.rs:851:46
    1: Execution error: The input to a take operation specified fragment id 0 but this fragment does not exist in the dataset
    2: Execution error: The input to a take operation specified fragment id 0 but this fragment does not exist in the dataset 

My app is currently batching up rows every 10 seconds, calling merge_insert based on an ID column, and then calling optimize afterwards. I have noticed that setting use_index to false on merge_insert fixes this problem, so I am suspecting that the index on the ID column can sometimes be corrupted? Equally I have found that the order of compaction vs optimizing index after the call to merge_insert matters. I only see the issue if I do compaction -> index optimize, but do not see the issue with index optimize -> compaction.

Note: This is with lancedb version 0.22.2

oscar-humannative avatar Oct 28 '25 11:10 oscar-humannative

Thanks for reporting this! We should take a look soon.

wjones127 avatar Oct 30 '25 15:10 wjones127

I’ve been experiencing the same issue since upgrading from python lancedb version 0.24.3 to version 0.25.3.

When this occurs, it appears that recreating the index is the only way to restore functionality.

martin-liu avatar Nov 20 '25 00:11 martin-liu

I’ve been experiencing the same issue since upgrading from python lancedb version 0.24.3 to version 0.25.3.

When this occurs, it appears that recreating the index is the only way to restore functionality.

Could you share more information? That might help us reproduce and find a fix.

What kind of index does this happen with? On what data type? And any indication of what operations happen between a good index state and an error like this?

wjones127 avatar Nov 20 '25 16:11 wjones127

@wjones127

Here is the additional information regarding our setup and the issue:

Index and Data Type

  • Index Type: BTree
  • Column/Data Type: We are merging insert on the id column (Type: [String]). It's a hash value of the chunk content.

Other columns are basically content (string), embedding (4096 vector), and a few additional string columns just for filtering

Workflow We perform incremental ingestion. For every new index job, we collect all the chunks, run a merge_insert followed by an optimize. The workflow looks like this:

# 1. Ensure Index exists
if not _has_index_type(vdb, "id", "BTree"):
    await vdb.create_index(col, config=BTree())

# 2. Upsert Logic
await ( vdb.merge_insert("id")
    .when_matched_update_all()
    .when_not_matched_insert_all()
    .execute(payload, on_bad_vectors="drop")
)

# 3. Optimization
await vdb.optimize(cleanup_older_than=timedelta(days=1))

Environment & Context

  • Versions: The issue started appearing after upgrading lancedb from 0.24.3 to 0.25.3.
  • Scale: We have multiple folders of different lancedb dataset, but this seems to affect folders that > 500M disk size ones.

Reproduction & Observations I haven't been able to reproduce this from scratch locally, but I have analyzed a copy of the corrupted dataset:

  1. Partial Corruption: Not all incoming chunks trigger the error. Some specific chunks will fail, but if I modify those chunks slightly (random modifications), they succeed.
  2. Resolution: If I recreate the index on the broken dataset, the issue resolves immediately, and ingestion works normally again.

martin-liu avatar Nov 21 '25 00:11 martin-liu

I also saw a same issue. I suspected data corruption creeped in somehow. I had to restore the data to previous version to fix the error

manhld0206 avatar Nov 21 '25 02:11 manhld0206

if this helps, here is a small dummy dataset that I was able to corrupt locally, and I think is related to this error: https://we.tl/t-DIU9m36Ixb

  db = lancedb.connect(
      "/Users/ogchen/Documents",
  )

  tbl = db.open_table("snapshot_b7deb25f-29ee-49b5-a122-e97ecd64e91a")
  schema = pa.schema(
      [
          pa.field("id", pa.string(), nullable=False),
          pa.field("file_details__content_type", pa.string()),
          pa.field("file_details__size", pa.int64()),
          pa.field("file_details__last_modified", pa.timestamp("ms")),
          pa.field("file_details__extension", pa.string()),
          pa.field("file_details__path", pa.string()),
          pa.field("file_details__type", pa.string()),
      ]
  )

  new_data = pa.table(
      {
          "id": ["fil_01KAH4XFV0E9SSE27AVAGN5E68"],
          "file_details__content_type": ["text/plain"],
          "file_details__size": [100],
          "file_details__last_modified": [time.time()],
          "file_details__extension": ["txt"],
          "file_details__path": ["blablabla.txt"],
          "file_details__type": ["text"],
      },
      schema=schema,
  )
  print(
      tbl.merge_insert("id")
      .when_matched_update_all()
      .when_not_matched_insert_all()
      .execute(new_data)
  )

running this script I see error

RuntimeError: lance error: Query Execution error: Execution error: The input to a take operation specified fragment id 0 but this fragment does not exist in the dataset, /rustc/1159e78c4747b02ef996e55082b704c09b970588/library/core/src/task/poll.rs:290:44

oscar-humannative avatar Nov 21 '25 08:11 oscar-humannative

Thanks for the repro, this is very helpful!

So the problem I see is that the fragment_bitmap on the id_idx is {2}, which doesn't contain fragment zero. When I introspect the index, I can see row ids from fragment zero:

>>> row_ids = LanceFileReader(data_path).read_all().to_table()['ids'].to_pylist()
>>> row_ids
[6, 8589934598, 3, 8589934595, 10, 8589934602, 4, 8589934596, 8, 8589934600, 5, 8589934597, 0, 8589934592, 1, 8589934593, 7, 8589934599, 2, 8589934594, 8589934613, 9, 8589934601, 8589934606, 8589934605, 8589934603, 8589934610, 8589934611, 8589934614, 8589934604, 8589934612, 8589934608, 8589934609, 8589934607]
>>> frag_ids = [id >> 32 for id in row_ids]
>>> frag_ids
[0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]

We are supposed to use that fragment bitmap to mask out any deleted fragments. However, since it's not present, it doesn't know to mask it out and we get that error:

https://github.com/lance-format/lance/blob/abce5a5a7c05d515c3b0edf6add3f95518390ecc/rust/lance/src/io/exec/scalar_index.rs#L361

So the question is: what happened to that fragment id? And also, what should we do when we encounter corrupted fragment_bitmaps like this?

wjones127 avatar Nov 21 '25 23:11 wjones127

Okay, here is a reproduction. The key is to:

  1. Upsert with columns out of order
  2. Compact
  3. Optimize indices
  4. Try to upsert again, with a key that would have matched the original data that was updated in step (1).
import lance
import pyarrow as pa

data = pa.table({'id': [1], 'value': ['a']})
ds = lance.write_dataset(data, 'memory://', max_rows_per_file=1)
ds.create_scalar_index('id', 'BTREE')

data_reversed = pa.table({'value': ['b', 'a'], 'id': [2, 1] })
(
    ds.merge_insert(on='id')
    .when_matched_update_all()
    .when_not_matched_insert_all()
    .execute(data_reversed)
)

ds.optimize.compact_files()
ds.optimize.optimize_indices()

(
    ds.merge_insert(on='id')
    .when_matched_update_all()
    .when_not_matched_insert_all()
    .execute(data)
)
OSError: Query Execution error: Execution error: The input to a take operation specified fragment id 0 but this fragment does not exist in the dataset, /Users/willjones/.rustup/toolchains/1.90.0-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/task/poll.rs:290:44

I'm going to open an issue upstream with this repro.

wjones127 avatar Nov 22 '25 00:11 wjones127