lance icon indicating copy to clipboard operation
lance copied to clipboard

Add version tags

Open ananis25 opened this issue 3 years ago • 7 comments

Adds the ability to tag a Lance dataset and retrieve it from python. Per issue #588

Notes

  • This assumes different versions of the dataset can be given the same tag. And at fetch-time, we just return the latest one. Is this reasonable semantics?
  • Referencing the linked issue - is the version metadat (VersionAuxData in protobuf?) something that the users pass themselves, or something Lance creates? This PR just takes the tag as input from the user.
  • Introducing keyword-only arguments in the python api. This PR makes asof and tag as keyword-only args in the dataset read method. Might also want to do it for the write_dataset method for all args following the writing mode? It is getting crowded.

ananis25 avatar Feb 18 '23 19:02 ananis25

Hmmm I guess if users are more familiar with git syntax, tag should be unique, which might be hard to enforce if it's only a str specific to the version.

changhiskhan avatar Feb 18 '23 20:02 changhiskhan

Tag should be provided by the user. VersionAuxData is more arbitrary metadata (say for lineage)

changhiskhan avatar Feb 18 '23 20:02 changhiskhan

asof / tag - we could consider adding these as instance methods to reduce clutter. It'll cause an extra metadata read but at the dataset level it doesn't seem like it would be a problem

changhiskhan avatar Feb 18 '23 21:02 changhiskhan

@ananis25 @eddyxu wdyt?

changhiskhan avatar Feb 18 '23 21:02 changhiskhan

asof / tag - we could consider adding these as instance methods to reduce clutter.

That sounds good, I will make the change.

Hmmm I guess if users are more familiar with git syntax, tag should be unique, which might be hard to enforce if it's only a str specific to the version.

I was expecting the use-case to be - a user making tags like release or production, say at the end of each week. Untagging the older version and giving it to the latest one would be a destructive operation.

ananis25 avatar Feb 19 '23 12:02 ananis25

Hmmm I guess if users are more familiar with git syntax, tag should be unique, which might be hard to enforce if it's only a str specific to the version.

I was expecting the use-case to be - a user making tags like release or production, say at the end of each week. Untagging the older version and giving it to the latest one would be a destructive operation.

Oh interesting. Yes that's another use case for tagging. One problem is that this breaks the immutability of the dataset version. It might be that we need to redesign the protos so tags is a separate metadata file that is version-free. This would:

  1. Allow us to look the version for a given tag with 1 extra io instead reading all versions
  2. Allow us to change tags without making versions mutable.

@ananis25 @eddyxu thoughts?

changhiskhan avatar Feb 20 '23 01:02 changhiskhan

One problem is that this breaks the immutability of the dataset version.

Oh, I hadn't thought of it. Do you mean the two queries below will return different datasets?

  • ds = Dataset.with_tag("release") run on Jan1
  • ds = Dataset.with_tag("release") run on Feb1

Hmm, that could be an issue. One might expect that pinning tags or version ids in your code is equivalent but only the latter guarantees reproducibility.

It might be that we need to redesign the protos so tags is a separate metadata file that is version-free.

This sounds good if it doesn't break existing lance datasets. Another benefit would come from that tagging is often done post-hoc. For ex, when writing multiple batches of data, supplying it along with the write_dataset call would involve checking if it is indeed the last iteration.

for batch in all_batches:
  if last_batch():
    lance.write_dataset(batch, uri, tag="release")
  else:
    lance.write_dataset(batch, uri)

If stored separately as metadata in the global manifest, we could simplify the API as,

for batch in all_batches:
  lance.write_dataset(batch, uri)

# same or separate process, once sure all data has been persisted
lance.add_tag(uri, "release")

ananis25 avatar Feb 20 '23 06:02 ananis25

@ananis25 I am gonna to close this cause there was not activity for a while. Feel free to reopen it or a new PR if you'd like to work on it again.

eddyxu avatar Jun 27 '23 20:06 eddyxu