lance
lance copied to clipboard
overwrite and append mode behavior
Transcribing/Paraphrasing feedback from a separate venue.
When using lance.write_dataset it's not obvious whether the dataset already exists. The current semantics are:
create - if it exists then raises an error, otherwise writes new dataset folder
overwrite - if it exists then it writes a new version, otherwise raises error
append - if it exists then it writes a new version, otherwise raises error
A user reported that "if the file does not exist, I would expect the function to create it, but instead, I see an error that the manifest doesn't exist".
An additional suggestion was "For append, I'd say issue a warning then make a new one"
Can you use write_dataset to concatenate two Lance files, including merging vector indexes? I've got a few hundreds Lance files with ~1M vectors each. Would be great to be able to merge them without having to recompute the index
Suppose you have dataset1 at uri1 and dataset2 at uri2:
Current state - lance.write_dataset(dataset2.to_table(), uri1, mode="append") allows you to concatenate to dataset1.
Todo:
- currently the indices are not merged
- need to add support so you don't have to call
to_table()
very excited for both of those. The index merging enables large-scale vector search while still building the lance files in a separate nodes / machines / processes.
And, I think, not having to call to_table() means I could look up individual rows across the whole partitioned dataset.... very exciting!
oh but if you have several hundred, we may want to consider a different API for that.
e.g., we could have each process write their lance data and the index to the same destination, and then at the very end, re-write the manifest altogether.
i'll start a separate issue to discuss this
Hi - extending the original issue and listing all the cases where raising an error might be reasonable behavior.
Does the following look exhaustive? Note this only considers datasets restricted to append-only updates (no UPSERTs).
Pattern matching on (Write-Mode, IF-file-with-uri-exists, IF-file-schema-matches-new-data-schema):
- (create, uri_exists, _) -> raise (counter - None)
- (append, !uri_exists, _) -> raise (counter -
fopenin Unix creates the file if it doesn't exist) - (append, uri_exists, !schema_match) -> raise (counter - None)
- (overwrite, !uri_exists, _) -> raise (counter - databases often have
CREATE OR REPLACE TABLE)
I think what makes sense would be the following:
create
- Create dataset if uri does not exist
- Error if uri exists
append
- New version if uri exists
- Create new dataset if uri does not exist
- Error if uri exists but schema does not match
overwrite
- New version if uri exists
- Create new dataset if uri does not exist
This should be done in Rust so other the behavior is consistent across languages
If we use write append dataset_new on dataset_v0, the value of dataset_v1 should be dataset_v0 union dataset_new right? Does our current implementation have a schema check before the union?
closed via #690