lance icon indicating copy to clipboard operation
lance copied to clipboard

overwrite and append mode behavior

Open changhiskhan opened this issue 2 years ago • 7 comments
trafficstars

Transcribing/Paraphrasing feedback from a separate venue.

When using lance.write_dataset it's not obvious whether the dataset already exists. The current semantics are:

create - if it exists then raises an error, otherwise writes new dataset folder
overwrite - if it exists then it writes a new version, otherwise raises error
append - if it exists then it writes a new version, otherwise raises error

A user reported that "if the file does not exist, I would expect the function to create it, but instead, I see an error that the manifest doesn't exist".

An additional suggestion was "For append, I'd say issue a warning then make a new one"

changhiskhan avatar Feb 13 '23 20:02 changhiskhan

Can you use write_dataset to concatenate two Lance files, including merging vector indexes? I've got a few hundreds Lance files with ~1M vectors each. Would be great to be able to merge them without having to recompute the index

cemoody avatar Feb 14 '23 00:02 cemoody

Suppose you have dataset1 at uri1 and dataset2 at uri2:

Current state - lance.write_dataset(dataset2.to_table(), uri1, mode="append") allows you to concatenate to dataset1.

Todo:

  1. currently the indices are not merged
  2. need to add support so you don't have to call to_table()

changhiskhan avatar Feb 14 '23 00:02 changhiskhan

very excited for both of those. The index merging enables large-scale vector search while still building the lance files in a separate nodes / machines / processes.

And, I think, not having to call to_table() means I could look up individual rows across the whole partitioned dataset.... very exciting!

cemoody avatar Feb 14 '23 00:02 cemoody

oh but if you have several hundred, we may want to consider a different API for that.

e.g., we could have each process write their lance data and the index to the same destination, and then at the very end, re-write the manifest altogether.

i'll start a separate issue to discuss this

changhiskhan avatar Feb 14 '23 00:02 changhiskhan

Hi - extending the original issue and listing all the cases where raising an error might be reasonable behavior.

Does the following look exhaustive? Note this only considers datasets restricted to append-only updates (no UPSERTs).

Pattern matching on (Write-Mode, IF-file-with-uri-exists, IF-file-schema-matches-new-data-schema):

  • (create, uri_exists, _) -> raise (counter - None)
  • (append, !uri_exists, _) -> raise (counter - fopen in Unix creates the file if it doesn't exist)
  • (append, uri_exists, !schema_match) -> raise (counter - None)
  • (overwrite, !uri_exists, _) -> raise (counter - databases often have CREATE OR REPLACE TABLE)

ananis25 avatar Feb 23 '23 16:02 ananis25

I think what makes sense would be the following:

create

  1. Create dataset if uri does not exist
  2. Error if uri exists

append

  1. New version if uri exists
  2. Create new dataset if uri does not exist
  3. Error if uri exists but schema does not match

overwrite

  1. New version if uri exists
  2. Create new dataset if uri does not exist

This should be done in Rust so other the behavior is consistent across languages

changhiskhan avatar Mar 15 '23 18:03 changhiskhan

If we use write append dataset_new on dataset_v0, the value of dataset_v1 should be dataset_v0 union dataset_new right? Does our current implementation have a schema check before the union?

Renkai avatar Mar 15 '23 23:03 Renkai

closed via #690

changhiskhan avatar Jul 02 '23 22:07 changhiskhan