Daft
Daft copied to clipboard
Add Unity Catalog Volume support
It would be great to start building out Volume support from Daft for Unity Catalog. Images and JSON feel like the highest prio to start with.
Right now Table supports looks like this:
# connect to UC
unity = UnityCatalog(endpoint, token)
# list catalog/schema/tables
print(unity.list_tables("unity.default"))
['unity.default.numbers', 'unity.default.marksheet_uniform', 'unity.default.marksheet']
# load table into Daft df
unity_table = unity.load_table("unity.default.numbers")
df = daft.read_delta_lake(unity_table)
It would be cool to be able to do something like this (pseudo-code):
# load volume
unity_volume = unity.load_volume("unity.default.images")
# get refs/urls per image
img_refs = unity_volume.get_references()
df_img = df.with_column("image_refs", img_refs)
df_img = df.with_column("image_bytes", df["image_refs"].uc_volume.download(on_error="null"))
Of course, whatever API we design needs to be able to handle many different data types. Perhaps it makes more sense to introduce a sublevel per dtype, e.g.
-
.uc_volume.image.download()
-
.uc_volume.json.quey()
In this case we could leverage existing methods/expressions in the url
, image
and json
modules.
Curious to hear what other folks think.