dlt icon indicating copy to clipboard operation
dlt copied to clipboard

data pond: expose readable datasets as dataframes and arrow tables

Open sh-rp opened this issue 1 year ago • 7 comments

Description

As an alternative to the ibis integration, we are testing out wether we can create our own data reader with not too much effort that works across all destinations.

Ticket for followup work after this PR is here: https://github.com/orgs/dlt-hub/projects/9/views/1?pane=issue&itemId=80696433

TODO

  • [x] Build dataset and relation interfaces (see @rudolfix comment below)
  • [x] Extend DBApiCursorImpl to support arrow tables (some native cursors support arrow)
  • [x] Ensure all native cursors that have native support for arrow and pandas forward this to DBApiCursorImpl
  • [x] Expose prepopulated duckdb instance from filesystem somehow? Possibly via fs_client interface
  • [x] Figure out default chunk sizes and a nice interface (some cursors / databases figure out their own chunk size such as snowflake, others only return chunks in vector sizes such as duckdb)
  • [ ] Ensure up to date docstrings on all new interface and methods

sh-rp avatar Jun 21 '24 13:06 sh-rp

Deploy Preview for dlt-hub-docs canceled.

Name Link
Latest commit fb9a445653a77d6e0664a0844021bb134a1f5606
Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/6704fad0fe6dab0008567bbd

netlify[bot] avatar Jun 21 '24 13:06 netlify[bot]

@rudolfix and @jorritsandbrink you can have another look if you like. Changes since last time:

  • Updated the interface as per @rudolfix requests. I actually like it a lot now too.
  • Moved the actual data accessing into the dbapicursor implementations, I think it fits quite well there.
  • Made the tests much better to test each function on the cursor.

I'm quite happy with the structure it has now, I'm just not so sure about the naming and placement of the protocols and types, @rudolfix maybe you could have a look and let me know what you think.

After this review I would suggest to make sure that all the docstrings are in order and the test coverage is good and then merge this and add the full docs in another PR.

sh-rp avatar Aug 07 '24 14:08 sh-rp

IMO the interface is OK. here is feedback for the code. it is still too early for a regular review:

General

  1. move all dataset-related code to dlt.destinations. try to remove as many inner imports as possible
  2. do not pass Pipeline anywhere. We can implement everything with simpler object

Type hierarchy I think what you did can be really simplified

Dataset:

  1. we do not need to take Pipeline here. What we need is to pass destination reference, dataset name and Schema(s) that are in the pipeline so we can create right destination client. some code that creates initial configuration may be moved from pipeline to dlt.destinations namespace but that is even better
  2. We should not need any list of prepared tables. AFAIK only filesystem needs that right? see below.
  3. I like that Dataset is fully lazy. we must keep it like that.
  4. Create an instance of destination client right away. It will keep all the context we need and also cache this in-memory database for filesystem

Relation: We do not need this class at all. OR drop any references to clients there and make it implement SupportsDataAccess, merge it with DBAPICursor and return it from execute_query in SqlClient. You do not need SupportsRelationshipAccess which does exactly the same. Relation of the form above should be very lightweight and do the same thing that cursor does

Filesystem: implement SqlClient for filesystem:

  • derive from duckdb sql_client
  • re-implement execute_query so we create views as needed by parsing SQL. here's oneliner with sqlglot. https://stackoverflow.com/questions/69684115/python-library-for-extracting-table-names-from-from-clause-in-sql-statetments OFC create only tables that you know: so those that are in the Schema associated with destination client (there's always one)
  • let's start with read only mode (views). we can add write mode ie. as a flag in the Dataset later
  • we can IMO drop jsonl support but if not, use the columns hint on the read_json to pass the exact schema that we have in the Schema for given table. in your current implementation autodetect will be used and schema will be different.

in the future we can implement second SqlClient for data fusion so the engine can be replaced (not only duckdb). WDYT?

rudolfix avatar Aug 08 '24 07:08 rudolfix

@rudolfix you can have another look at the file and class structure.

  • I would say I have worked out the placement and naming of the interfaces/protocols now.
  • There are no inner imports left (except for sqlglot), I also put all the stuff in the destination folder
  • the instantiation of the dataset is now done in the destination client, i think this makes sense, but let me know if you think otherwise.
  • I have not implemented the sql_client for the filesystem, the SQLClientBase does a lot of stuff next to plainly executing queries. I think I'd leave it the way it is for now, the read access is completely abstracted with these new Protocols anyway.
  • Maybe there is a smart way to forward method calls from ReadableRelation to the cursor, I have not found it yet ;)
  • How should we handle sqlglot? As an extra or just ask the user to install sqlglot when he accesses the filesystem dataset?

Also one other question, what do you think about using polars for creating arrow tables or exposing polars?

PS: I think we should also add functions for getting delta and iceberg tables on there eventually, so cursor and ReadableRelation implementations will deviate at some point.

sh-rp avatar Aug 08 '24 13:08 sh-rp

Per our discussion: Maybe we can also add a source function on the ReadableDataset and a resource function on the ReadableRelation which will create a DltSource and a DltResource respectively which also contains the schema if used. I'm currently not quite sure how our normalizer works with DataFrames and ArrowTables, but we should somehow automatically disable it in this case.

Additionally we should somehow enable all our sql destinations that support a ReadableDataset also to work as a source (in that case without a dlt schema) outside of a pipeline context with destination config vars with some kind of wrapper maybe. So they all automatically become built in sources that have a native dbapi / cursor implementation. In any case we should also add some sorting, filtering and row selecting support to the ReadableRelations I think, so people do not need to write raw sql if they need this.

sh-rp avatar Aug 09 '24 07:08 sh-rp

@rudolfix I have an architectural question: As you can see from my latest commit, I am not creating an arrow table schema from our dlt schema if there is no native arrow support on the cursor. For this the table columns are pushed right down into the cursor where we will eventually also need the destination capabilities of another destination if we pipe from one dataset into the next pipeline. So I'm thinking we should probably implement this stuff in the ReadableRelations class and only forward the desired arrow schema to the cursor, or even contruct the arrow table in the ReadableRelations if no native support is present on the cursor in question. Wdyt?

sh-rp avatar Aug 13 '24 15:08 sh-rp

@rudolfix if you like you can have another look. I think all your comment are adressed. Open questions for me still are:

  • I don't think we should be using the capabilities of the source destination when reading, rather it should be possible to pass in the capabilities of given destination pipeline if we know where the data is going
  • I was not quite sure what you mean with this dataframe document. Do you want to change anything in the place where we emit dataframes (which is done in the standard arrow way now and can be improved in followup work)
  • There were still some questions around the methods we expose on the readable dataset, I put comments there too.

sh-rp avatar Sep 23 '24 15:09 sh-rp

@rudolfix I think all is addressed in the additional commit I just made. I'm not quite sure what you mean with testing open_connection several times. I am testing this by opening the sql_client context on the filesystem a few times in a row now, but there is no test for parallel access, if that is what you need lmk.

sh-rp avatar Oct 07 '24 11:10 sh-rp