lance
lance copied to clipboard
Panic when creating empty table
Hello!
I've been running into some issues while creating a table with an empty batch record.
let schema = Arc::new(Schema::new(vec![
Field::new(
"vector",
DataType::FixedSizeList(
Arc::new(Field::new("item", DataType::Float32, true)),
768,
),
false,
),
Field::new("content", DataType::Utf8, false),
Field::new("file_url", DataType::Utf8, false),
Field::new("start_line_no", DataType::UInt32, false),
Field::new("end_line_no", DataType::UInt32, false),
]));
let batch = RecordBatch::try_new(
schema.clone(),
vec![
Arc::new(FixedSizeListBuilder::new(Float32Builder::new(), 768).finish()),
Arc::new(StringArray::from(Vec::<&str>::new())),
Arc::new(StringArray::from(Vec::<&str>::new())),
Arc::new(UInt32Array::from(Vec::<u32>::new())),
Arc::new(UInt32Array::from(Vec::<u32>::new())),
],
)
.expect("failure while defining schema");
let tbl = db
.create_table(
"code-slices",
Box::new(RecordBatchIterator::new(vec![batch].into_iter().map(Ok), schema)),
None,
)
.await
.expect("failed to create table");
tbl.create_index(&["vector"])
.ivf_pq()
.num_partitions(256)
.build()
.await
.expect("failed to create index");
The statistics collector finish()
method panics here when attempting to create a StructArray
because "Found unmasked nulls for non-nullable StructArray field min_value
".
If I instead remove batch
and pass an empty vec![]
to RecordBatchIterator::new
, I get an error with the index creation that says it "can not train 256 centroids with 0 vectors".
Is there a way to initialise an empty table w/ an index with the Rust client?
Let me know if this isn't the right place to report this issue, I'll move it to the appropriate place.
There is no way to create index without data. The index is IVF_PQ
, which requires data to train a k-means clustering.
What is your use case, i.e. how many vectors, how good of a recall number do you want to see? how frequently do you make updates? We can try to give you some suggestions
There is no way to create index without data. The index is IVF_PQ, which requires data to train a k-means clustering.
That makes sense, I was expecting it to be an index like you have on traditional databases where you can set it for a field and it is updated as you insert data.
I'm currently adding repository context to my language server. I want to enhance the prompt I send to a model with bits of relevant context from the user's codebase. This means that there will be a first pass to initialize the database the first time a user opens a project & then on each file update (every time the user types) I'll also need to update the embeddings. Here is a link to the PR if you're curious, I've been exploring using my own very simplified vector store as well :)