Add version tags
Adds the ability to tag a Lance dataset and retrieve it from python. Per issue #588
Notes
- This assumes different versions of the dataset can be given the same tag. And at fetch-time, we just return the latest one. Is this reasonable semantics?
- Referencing the linked issue - is the version metadat (VersionAuxData in protobuf?) something that the users pass themselves, or something Lance creates? This PR just takes the tag as input from the user.
- Introducing keyword-only arguments in the python api.
This PR makes
asofandtagas keyword-only args in the dataset read method. Might also want to do it for thewrite_datasetmethod for all args following the writing mode? It is getting crowded.
Hmmm I guess if users are more familiar with git syntax, tag should be unique, which might be hard to enforce if it's only a str specific to the version.
Tag should be provided by the user. VersionAuxData is more arbitrary metadata (say for lineage)
asof / tag - we could consider adding these as instance methods to reduce clutter. It'll cause an extra metadata read but at the dataset level it doesn't seem like it would be a problem
@ananis25 @eddyxu wdyt?
asof / tag - we could consider adding these as instance methods to reduce clutter.
That sounds good, I will make the change.
Hmmm I guess if users are more familiar with git syntax, tag should be unique, which might be hard to enforce if it's only a str specific to the version.
I was expecting the use-case to be - a user making tags like release or production, say at the end of each week. Untagging the older version and giving it to the latest one would be a destructive operation.
Hmmm I guess if users are more familiar with git syntax, tag should be unique, which might be hard to enforce if it's only a str specific to the version.
I was expecting the use-case to be - a user making tags like
releaseorproduction, say at the end of each week. Untagging the older version and giving it to the latest one would be a destructive operation.
Oh interesting. Yes that's another use case for tagging. One problem is that this breaks the immutability of the dataset version. It might be that we need to redesign the protos so tags is a separate metadata file that is version-free. This would:
- Allow us to look the version for a given tag with 1 extra io instead reading all versions
- Allow us to change tags without making versions mutable.
@ananis25 @eddyxu thoughts?
One problem is that this breaks the immutability of the dataset version.
Oh, I hadn't thought of it. Do you mean the two queries below will return different datasets?
-
ds = Dataset.with_tag("release")run on Jan1 -
ds = Dataset.with_tag("release")run on Feb1
Hmm, that could be an issue. One might expect that pinning tags or version ids in your code is equivalent but only the latter guarantees reproducibility.
It might be that we need to redesign the protos so tags is a separate metadata file that is version-free.
This sounds good if it doesn't break existing lance datasets. Another benefit would come from that tagging is often done post-hoc. For ex, when writing multiple batches of data, supplying it along with the write_dataset call would involve checking if it is indeed the last iteration.
for batch in all_batches:
if last_batch():
lance.write_dataset(batch, uri, tag="release")
else:
lance.write_dataset(batch, uri)
If stored separately as metadata in the global manifest, we could simplify the API as,
for batch in all_batches:
lance.write_dataset(batch, uri)
# same or separate process, once sure all data has been persisted
lance.add_tag(uri, "release")
@ananis25 I am gonna to close this cause there was not activity for a while. Feel free to reopen it or a new PR if you'd like to work on it again.