lance overwrite and append mode behavior

trafficstars

Transcribing/Paraphrasing feedback from a separate venue.

When using lance.write_dataset it's not obvious whether the dataset already exists. The current semantics are:

create - if it exists then raises an error, otherwise writes new dataset folder
overwrite - if it exists then it writes a new version, otherwise raises error
append - if it exists then it writes a new version, otherwise raises error

A user reported that "if the file does not exist, I would expect the function to create it, but instead, I see an error that the manifest doesn't exist".

An additional suggestion was "For append, I'd say issue a warning then make a new one"

Feb 13 '23 20:02 changhiskhan

Can you use write_dataset to concatenate two Lance files, including merging vector indexes? I've got a few hundreds Lance files with ~1M vectors each. Would be great to be able to merge them without having to recompute the index

Feb 14 '23 00:02 cemoody

Suppose you have dataset1 at uri1 and dataset2 at uri2:

Current state - lance.write_dataset(dataset2.to_table(), uri1, mode="append") allows you to concatenate to dataset1.

Todo:

currently the indices are not merged
need to add support so you don't have to call to_table()

Feb 14 '23 00:02 changhiskhan

very excited for both of those. The index merging enables large-scale vector search while still building the lance files in a separate nodes / machines / processes.

And, I think, not having to call to_table() means I could look up individual rows across the whole partitioned dataset.... very exciting!

Feb 14 '23 00:02 cemoody

oh but if you have several hundred, we may want to consider a different API for that.

e.g., we could have each process write their lance data and the index to the same destination, and then at the very end, re-write the manifest altogether.

i'll start a separate issue to discuss this

Feb 14 '23 00:02 changhiskhan

Hi - extending the original issue and listing all the cases where raising an error might be reasonable behavior.

Does the following look exhaustive? Note this only considers datasets restricted to append-only updates (no UPSERTs).

Pattern matching on (Write-Mode, IF-file-with-uri-exists, IF-file-schema-matches-new-data-schema):

(create, uri_exists, _) -> raise (counter - None)
(append, !uri_exists, _) -> raise (counter - fopen in Unix creates the file if it doesn't exist)
(append, uri_exists, !schema_match) -> raise (counter - None)
(overwrite, !uri_exists, _) -> raise (counter - databases often have CREATE OR REPLACE TABLE)

Feb 23 '23 16:02 ananis25

I think what makes sense would be the following:

create

Create dataset if uri does not exist
Error if uri exists

append

New version if uri exists
Create new dataset if uri does not exist
Error if uri exists but schema does not match

overwrite

New version if uri exists
Create new dataset if uri does not exist

This should be done in Rust so other the behavior is consistent across languages

Mar 15 '23 18:03 changhiskhan

If we use write append dataset_new on dataset_v0, the value of dataset_v1 should be dataset_v0 union dataset_new right? Does our current implementation have a schema check before the union?

Mar 15 '23 23:03 Renkai

closed via #690

Jul 02 '23 22:07 changhiskhan

lance lance copied to clipboard

overwrite and append mode behavior

lance
lance copied to clipboard