lance icon indicating copy to clipboard operation
lance copied to clipboard

feat(rust): implement a ListingCatalog

Open yanghua opened this issue 11 months ago • 14 comments

yanghua avatar Dec 26 '24 09:12 yanghua

Codecov Report

Attention: Patch coverage is 88.38710% with 18 lines in your changes missing coverage. Please review.

Project coverage is 78.85%. Comparing base (58c5e27) to head (cb98b79).

Files with missing lines Patch % Lines
rust/lance/src/catalog/catalog_trait.rs 88.38% 9 Missing and 9 partials :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3300      +/-   ##
==========================================
+ Coverage   78.81%   78.85%   +0.03%     
==========================================
  Files         250      251       +1     
  Lines       91306    91461     +155     
  Branches    91306    91461     +155     
==========================================
+ Hits        71963    72121     +158     
+ Misses      16390    16376      -14     
- Partials     2953     2964      +11     
Flag Coverage Δ
unittests 78.85% <88.38%> (+0.03%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Dec 26 '24 09:12 codecov-commenter

Hi @westonpace I drafted a PR for discussion.

About naming, some thoughts:

  • xxx_dataset or xxx_table API in Catalog which one do you prefer?
  • xxx_namespace or xxx_database in Catalog which one do you prefer?

yanghua avatar Dec 26 '24 12:12 yanghua

@westonpace thanks for your comments.

Naming-wise I think I ended up with "database" (postgres uses both I think) where you chose "catalog" (lots of tools use this). I don't care too strongly which we want to go with. If you want to stick with catalog then feel free to open up a PR renaming the concepts.

I also do not care about the naming very much. Actually, the concept of "database" equivalence is "namespace". Generally, there are three basic layers: catalog -> database -> table. The "catalog" concept is usually used to represent the "tenant" or something else.

About the PRs you provided, I need some time to do some research(including lancedb). If I have questions, I will ask you.

The first question is: the native table is a pure lance format dataset?

image

Or expressed in another way: if we create a native table using lancedb, can it be read/written via Lance SDK and Lancedb?

yanghua avatar Feb 11 '25 12:02 yanghua

The first question is: the native table is a pure lance format dataset?

Or expressed in another way: if we create a native table using lancedb, can it be read/written via Lance SDK and Lancedb?

Yes. The core table trait in lancedb is BaseTable. There are two implementations today NativeTable and RemoteTable. I imagine there will be more implementations in the future for things like views, etc.

NativeTable is just "a lance dataset". It is the only implementation of BaseTable that supports to_lance at the moment.

westonpace avatar Feb 11 '25 18:02 westonpace

The LanceDB APIs (python, rust, nodejs) will work with anything that implements BaseTable.

westonpace avatar Feb 11 '25 18:02 westonpace

@westonpace What about this design?

image

I added a Catalog concept on top of Database. Its responsibility is to manage databases. Database is used to manage BaseTable as you have done.

yanghua avatar Feb 13 '25 06:02 yanghua

we can maybe key off the URIs in the connect function or pass the adapter to the connect function

Is there any document that describes the URL spec? I searched the lancedb codebase, and only found two types:

  • local path, with no file:// prefix;
  • s3 URL, e.g. s3://bucket/root/path

anything else?

yanghua avatar Feb 13 '25 07:02 yanghua

@westonpace What about this design?

The design looks good to me. What would the API to use this look like? Something like...

import lancedb

cat = lancedb.connect_catalog(CATALOG_URI)
db = cat.open_database("mydb")
tab = db.open_table("mytbl")
...

How would HMS catalog be implemented? I think there's a few options:

External-Catalog-Options

westonpace avatar Feb 17 '25 14:02 westonpace

Is there any document that describes the URL spec? I searched the lancedb codebase, and only found two types:

There is some documentation here: https://lancedb.github.io/lancedb/guides/storage/#object-stores

There are also gs:// and az:// URLs. In addition, dynamo db can be used as a commit handler in S3 by using s3+ddb://.

westonpace avatar Feb 17 '25 14:02 westonpace

What would the API to use this look like?

IMHO, the demo code snippet you have shown is nice. I just ask your thoughts: if you agree to introduce the Catalog concept. It seems we agree about this. I will introduce a Catalog trait in LanceDB repo, WDYT?

Filed a ticket here: https://github.com/lancedb/lancedb/issues/2132

yanghua avatar Feb 18 '25 09:02 yanghua

How would HMS catalog be implemented?

Personally, I would prefer option 2. Like this picture:

image

Something reasons:

  • Unified Protocol: HTTP;
  • Adaptation-friendly: Language unbound;

yanghua avatar Feb 18 '25 09:02 yanghua

@westonpace if we apply option 2, there would be another question, what's the relationship between LanceDB and Spark/Ray connector(they also need to access catalog)? Look like this?

image

Please note the red words.

yanghua avatar Feb 18 '25 13:02 yanghua

if we apply option 2, there would be another question, what's the relationship between LanceDB and Spark/Ray connector(they also need to access catalog)? Look like this?

I agree the connectors may want to access the catalog API instead of just the DB API. They could use connect_catalog instead of connect.

Currently the ray connector works on a single table directly:

import ray
ds = ray.data.read_lance( 
    uri="./db_name.lance",
    columns=["image", "label"],
    filter="label = 2 AND text IS NOT NULL",
)

If we want a ray connector that works with a catalog it could look something like:

import ray
ds = ray.data.read_lance_catalog( 
    uri=CATALOG_URI,
    database="mydb",
    tbl="mytbl",
    columns=["image", "label"],
    filter="label = 2 AND text IS NOT NULL",
)

Or potentially even something like the mysql / postgres endpoints that can do joins, etc.

westonpace avatar Feb 19 '25 13:02 westonpace

@westonpace It also means both spark and ray depend on lancedb rust bindings not lance rust binding, right? Does it look ok for you?

And, maybe the spark/ray related code is not suited for putting in the lance repo?

yanghua avatar Feb 19 '25 13:02 yanghua

It has been landed in the lancedb repo via this PR: https://github.com/lancedb/lancedb/pull/2148

Thanks @westonpace @openinx @SaintBacchus

yanghua avatar Mar 04 '25 06:03 yanghua