Daft icon indicating copy to clipboard operation
Daft copied to clipboard

Add Unity Catalog Volume support

Open avriiil opened this issue 7 months ago • 3 comments

It would be great to start building out Volume support from Daft for Unity Catalog. Images and JSON feel like the highest prio to start with.

Right now Table supports looks like this:

# connect to UC
unity = UnityCatalog(endpoint, token)

# list catalog/schema/tables
print(unity.list_tables("unity.default"))
['unity.default.numbers', 'unity.default.marksheet_uniform', 'unity.default.marksheet']

# load table into Daft df 
unity_table = unity.load_table("unity.default.numbers")
df = daft.read_delta_lake(unity_table)

It would be cool to be able to do something like this (pseudo-code):

# load volume
unity_volume = unity.load_volume("unity.default.images")

# get refs/urls per image
img_refs = unity_volume.get_references()
df_img = df.with_column("image_refs", img_refs)

df_img = df.with_column("image_bytes", df["image_refs"].uc_volume.download(on_error="null"))

Of course, whatever API we design needs to be able to handle many different data types. Perhaps it makes more sense to introduce a sublevel per dtype, e.g.

  • .uc_volume.image.download()
  • .uc_volume.json.quey()

In this case we could leverage existing methods/expressions in the url, image and json modules.

Curious to hear what other folks think.

avriiil avatar Jul 04 '24 12:07 avriiil