lance
lance copied to clipboard
Update datasets by adding columns (eg... schema evolution availability and functioning for lance datasets)
From reading the design docs and motivation, schema evolution by adding columns to existing datasets without "overwrite"
ing the dataset seems to be the ideal end state (through a DataFile
). I think this functionality isn't currently in the python api, but is it available in rust at the moment? Is there a rough timeline of when this might be available?
We have some datasets that we plan to add columns to over time and it was attractive to have a solution that we could automatically version datasets while adding columns to datasets over time. Essentially we produce daily snapshots of features in the dataset that would get appended to the dataset daily, but when new columns are desired the ability to backfill and then continue to add those features going forward with minimal effort seemed very appealing. Also, open to any other ideas for what is currently implemented or what is coming down the pipe soon?
Also, how would file5
look being added after the first 3 operations that contains both column a and column b (from the figure in the design doc). Would it be a Fragment
across the bottom of both file2
and file4
to include both column and column b?
Hey thanks for the issue! We had enabled column merging in the previous C++ implementation. We just haven't had time to bring that back to life yet in Rust. We hope to work on it in the next month or so - though it's not a firm timeframe.
What does your use case look like? Would love to chat in more detail about it to help us design the feature better.
In terms of appending new rows after column merge, yes the new fragment would contain the union of columns in the schema.
Okay! I saw it mentioned in the docs and in the python package, but with a blank function. Also, searching the codebase I didn't see it in the rust files so this makes sense.
We had enabled column merging in the previous C++ implementation. We just haven't had time to bring that back to life yet in Rust. We hope to work on it in the next month or so - though it's not a firm timeframe.
Yes, we can have a chat as this feature is pretty intriguing to me so also curious what you guys had in mind. I think it's something that differentiates from many other packages that I know for doing simple schema evolution without rolling your own solution.
What does your use case look like? Would love to chat in more detail about it to help us design the feature better.
👍
In terms of appending new rows after column merge, yes the new fragment would contain the union of columns in the schema.
Want to discuss in our discord? Or if you prefer zoom, send me an email at [email protected]? Look forward to chatting
wanting to put a +1 on this thread, would love to use lance but need to be able to add columns.
#815 is the first piece to make this happen. Looking for deliver an e2e solution by EOW.
Hi, @JSpenced. We just shipped a rough MVP of adding columns.
Have some questions for you. Currently we have two flavor of APIs for "adding columns"
- Dataset merge (left-join) a pre-computed table, with a key column:
Dataset::merge(a_table, left_on=col, right_on=col)
This requires the table is precomputed on single machine, and we will do a hash join in memory to add the new columns.
- You can provide a udf to different machines:
fragments = dataset.get_fragments()
# distributed fragments to multiple machines
# and on each machine
def my_udf(input: RecordBatch) -> RecordBatch:
# do some stuff
new_fragment = fragment.update(my_udf)
# Commit
dataset.commit(new_schema, [new_fragments])
Which one suits your use case better @JSpenced . We can improve the API a bit.
This is done in #824. @JSpenced
@eddyxu Awesome, looking at the test code looks like you merge a column onto an existing dataset and it will create a new version. If the column supports nulls and they all rows don't match on the join, they will be filled with nulls. If not, it throws an error.
This looks great, but let me test it a bit sometime this week when I get a chance. Will raise an issue with any problems.
I missed your comment above that I just saw now, but for us the first use case is good for now as we won't be using fragments. Also, definitely a good to have for those with bigger data.