lance
lance copied to clipboard
feat(rust): implement a ListingCatalog
Codecov Report
Attention: Patch coverage is 88.38710% with 18 lines in your changes missing coverage. Please review.
Project coverage is 78.85%. Comparing base (
58c5e27) to head (cb98b79).
| Files with missing lines | Patch % | Lines |
|---|---|---|
| rust/lance/src/catalog/catalog_trait.rs | 88.38% | 9 Missing and 9 partials :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #3300 +/- ##
==========================================
+ Coverage 78.81% 78.85% +0.03%
==========================================
Files 250 251 +1
Lines 91306 91461 +155
Branches 91306 91461 +155
==========================================
+ Hits 71963 72121 +158
+ Misses 16390 16376 -14
- Partials 2953 2964 +11
| Flag | Coverage Δ | |
|---|---|---|
| unittests | 78.85% <88.38%> (+0.03%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Hi @westonpace I drafted a PR for discussion.
About naming, some thoughts:
xxx_datasetorxxx_tableAPI inCatalogwhich one do you prefer?xxx_namespaceorxxx_databaseinCatalogwhich one do you prefer?
@westonpace thanks for your comments.
Naming-wise I think I ended up with "database" (postgres uses both I think) where you chose "catalog" (lots of tools use this). I don't care too strongly which we want to go with. If you want to stick with catalog then feel free to open up a PR renaming the concepts.
I also do not care about the naming very much. Actually, the concept of "database" equivalence is "namespace". Generally, there are three basic layers: catalog -> database -> table. The "catalog" concept is usually used to represent the "tenant" or something else.
About the PRs you provided, I need some time to do some research(including lancedb). If I have questions, I will ask you.
The first question is: the native table is a pure lance format dataset?
Or expressed in another way: if we create a native table using lancedb, can it be read/written via Lance SDK and Lancedb?
The first question is: the native table is a pure lance format dataset?
Or expressed in another way: if we create a native table using lancedb, can it be read/written via Lance SDK and Lancedb?
Yes. The core table trait in lancedb is BaseTable. There are two implementations today NativeTable and RemoteTable. I imagine there will be more implementations in the future for things like views, etc.
NativeTable is just "a lance dataset". It is the only implementation of BaseTable that supports to_lance at the moment.
The LanceDB APIs (python, rust, nodejs) will work with anything that implements BaseTable.
@westonpace What about this design?
I added a Catalog concept on top of Database. Its responsibility is to manage databases. Database is used to manage BaseTable as you have done.
we can maybe key off the URIs in the connect function or pass the adapter to the connect function
Is there any document that describes the URL spec? I searched the lancedb codebase, and only found two types:
- local path, with no
file://prefix; - s3 URL, e.g.
s3://bucket/root/path
anything else?
@westonpace What about this design?
The design looks good to me. What would the API to use this look like? Something like...
import lancedb
cat = lancedb.connect_catalog(CATALOG_URI)
db = cat.open_database("mydb")
tab = db.open_table("mytbl")
...
How would HMS catalog be implemented? I think there's a few options:
Is there any document that describes the URL spec? I searched the lancedb codebase, and only found two types:
There is some documentation here: https://lancedb.github.io/lancedb/guides/storage/#object-stores
There are also gs:// and az:// URLs. In addition, dynamo db can be used as a commit handler in S3 by using s3+ddb://.
What would the API to use this look like?
IMHO, the demo code snippet you have shown is nice. I just ask your thoughts: if you agree to introduce the Catalog concept. It seems we agree about this. I will introduce a Catalog trait in LanceDB repo, WDYT?
Filed a ticket here: https://github.com/lancedb/lancedb/issues/2132
How would HMS catalog be implemented?
Personally, I would prefer option 2. Like this picture:
Something reasons:
- Unified Protocol: HTTP;
- Adaptation-friendly: Language unbound;
@westonpace if we apply option 2, there would be another question, what's the relationship between LanceDB and Spark/Ray connector(they also need to access catalog)? Look like this?
Please note the red words.
if we apply option 2, there would be another question, what's the relationship between LanceDB and Spark/Ray connector(they also need to access catalog)? Look like this?
I agree the connectors may want to access the catalog API instead of just the DB API. They could use connect_catalog instead of connect.
Currently the ray connector works on a single table directly:
import ray
ds = ray.data.read_lance(
uri="./db_name.lance",
columns=["image", "label"],
filter="label = 2 AND text IS NOT NULL",
)
If we want a ray connector that works with a catalog it could look something like:
import ray
ds = ray.data.read_lance_catalog(
uri=CATALOG_URI,
database="mydb",
tbl="mytbl",
columns=["image", "label"],
filter="label = 2 AND text IS NOT NULL",
)
Or potentially even something like the mysql / postgres endpoints that can do joins, etc.
@westonpace It also means both spark and ray depend on lancedb rust bindings not lance rust binding, right? Does it look ok for you?
And, maybe the spark/ray related code is not suited for putting in the lance repo?
It has been landed in the lancedb repo via this PR: https://github.com/lancedb/lancedb/pull/2148
Thanks @westonpace @openinx @SaintBacchus