dlt
dlt copied to clipboard
access data after load load as dataframes with ibis
Background ibis https://github.com/ibis-project/ibis is a library that translates dataframe expressions into SQL statement and then executes them in the destination. they do nice work of compiling final SQL statement with sqlglot (so probably resultant SQL is quite optimized)
We have large overlap in destinations and we were looking for decent dataframe -> sql thing from the very start. it seems that's it: we can easily build a helper that exposes any dlt dataset as dataframe, share credentials etc.
Implementation
We can integrate deeply or via a helper scheme. In case of helper, we allow users to get ibis
connection from dlt
destination and/or pipeline. The UX will be similar to dbt
helper.
Deep integration means that we expose the loaded data from the Pipeline
, DltSource
and DltResource
instances. ie.
pipeline.run(source, destination="bigquery", dataset_name="data")
# duckdb-like interface
# get rows dbapi style
for row in pipeline.dataset.sql("SELECT * FROM table"):
...
# get materialized dataframes
for batch in pipeline.dataset.sql("SELECT * FROM table").arrow(batch_size=10000):
...
# get lazy dataframe via ibis
ibis_table = pipeline.dataset.df(table_name)
ibis_connection = ibis_table = pipeline.dataset.df()
# we can expose the `dataset` property of the pipeline via source (and internal resources as well), in that case we automatically bind
to right schema
ibis_table = resource.dataset.df()
ibis_connection = source.dataset.df()
Implementation is straightforward for sql-like destinations. We won't support vector databases.
It would be really interesting to support filesystem destination as above. ie. by registering the json and parquet files in temporary duckdb database and then exposing the database for ibis
and sql
access methods
** Ibis Connection sharing** We are discussing a connection sharing approach with ibis here: https://github.com/ibis-project/ibis/issues/8877. As mentioned in the comments there, we could build it in a way that we manage the connection and ibis provides backends that accept an open connection and DO NOT need any addtionally dependencies.
hi @rudolfix, I'm working on Ibis and we were just discussing dlt
and a potential integration! I'm curious how you found Ibis per:
we were looking for decent dataframe -> sql thing from the very start
we'd be happy to help move this along, particularly if there are any questions we can answer. in my cursory look at dlt
and its destinations last week, basically all except the dedicated vector databases are covered as Ibis backends. let us know if we can help move this forward!
Hi to you both! I recently spent a decent amount of time with dlt + Ibis and I think there's a very clean abstraction to hand-off dlt to Ibis.
dlt side
From the dlt perspective, users pass credentials to create a connection to their Destination
(several ways to do so). The connection is attached to the Pipeline
and currently available through the .sql_client()
method (user guide, SqlClientBase class). For example, the native duckdb connection is available through
pipeline.sql_client().open_connection()
ibis side
In the upcoming Ibis major release, backends are assigned a self.con
attribute inside their .do_connect()
method used for connection.
integration
To hand-off the connection from dlt to Ibis, I got this working
import ibis
import dlt
pipeline = dlt.pipeline(destination="duckdb, ...)
ibis.set_backend("duckdb")
ibis_connection = ibis.get_backend() # will return non-connected backend
ibis_connection.con = pipeline.sql_client().open_connection()
ibis_connection.list_tables() # will successfully read data
TODO
- IMO, the integration should live under the dlt repo and be accessible through an API similar to the SQL client.
with pipeline.ibis_client() as client:
client.list_tables()
-
With dlt and Ibis involve, we need to set explicit behaviors to open/close connections. For instance, how should Ibis behave if the dlt
Pipeline
closes its connection? I see there is a.native_connection
attribute (ref), but I had trouble keeping the connection open if assigning it toibis_connection.con
-
This extension will work for all dlt
Destinations
that are also an ibisBackend
. We should handle gracefully unsupported hand-offs. -
[not required] On the Ibis side, we could extend the top-level connection API to allow
ibis.set_backend()
to receive a native db connection (duckdb, postgres, etc.) and automatically assign the proper backend or a new method?
@lostmygithubaccount @zilto integrating ibis via ibis_client()
is indeed straightforward and we could start there. initially I'd probably go with a helper function (like we have dbt and airflow helpers) that when provided with and instance of pipeline or destination would return working ibis client.
What about reading parquet files? There's a way to register a parquet file and query it. Are we able to register parquet files with fsspec
? what about parquet files that have different schemas (but backward compatible - columns are appended). Could we still register and query them?
My goal here is to use ibis as the dataframe engine :) and expose it as I imagined in the initial post. so whenever users want to interact with dataset via dataframes, they get ibis client, if they want to interact via sql they get (more or less) dbapi connection. the interface is partially inspired by what duckdb does. what is your take on this? maybe I go to far with hiding what is really under the hood.
@zilto
With dlt and Ibis involve, we need to set explicit behaviors to open/close connections. For instance, how should Ibis behave if the dlt Pipeline closes its connection? I see there is a .native_connection attribute (ref), but I had trouble keeping the connection open if assigning it to ibis_connection.con
here we could to two things:
- pass the credentials to ibis, not the open connection. this needs more work unfortunately, in most cases credentials will match (connection string) but then ibis will manage connections itself. this will also cover all weird cases with multithreaded code etc. right?
- we could change
sql_client
interface to "borrow" connection to external object that will be responsible for managing it andsql_client
goes into closed state. something like rust does...
Doing ELT
With Extract, Transform, Load (ETL
) as a frame of reference, a dlt pipeline does EL
and the first valuable integration would be adding T
with Ibis. A core value prop of Ibis is executing transformations directly on the backend where you loaded data and potentially returning results in-memory.
To make this possible:
- Ibis needs to connect to the backend (one of your two solution, Ibis team will be more helpful here)
- dlt needs to tell Ibis where the data is. For example, while working on Hamilton, I found dlt stores data in specific duckdb schema/database (ref). Note that this Ibis API is currently changing
^This is where there's immediate value, just needs a bit of coordination
Doing ETL
I'm now more familiar about the dlt internals (extract, normalize, transform), using the loader_file_format="parquet"
we effectively have E -> parquet -> L
. @rudolfix is this why there was mention of parquet? We could effectively load extracted data in batch or streaming via Ibis T
between dlt's E and L steps.
Ibis code is primarily about building "expressions" until an "execution operation" (e.g., insert data, return as dataframe). To start defining expressions, Ibis needs a table
which can be an actual table on a backend, a memtable
(ref) that you can load from pyarrow.Table
(loader files), or even just a schema with table
(ref). For you question about columns, this is cool because it means we could validate Ibis expression based on dlt schemas, without passing any data through
The dlt schema evolution / data contract + the Ibis and Substrait relational algebra could provide full lineage and granular "diff" and visibility over breaking changes
I have experimented a bit with this here: https://github.com/dlt-hub/dlt/pull/1491. There is no proper way to hand over native connections to ibis backends at the moment. For the moment I am getting the backends and just setting the .con property, but this does not work for most destinations, so there'd have to be some work on the ibis project to get this to work.
@lostmygithubaccount are there any plans to allow sharing of an open connection with ibis? You can see in my code that I am just setting the con
property, but a lot of setup code that would run when executing do_connect
will actually not run in this case.
hi @sh-rp, let me try to pull in one of the more experienced engineers on the team -- some initial answers:
- you can sometimes get the connection from the client connection, e.g.:
[ins] In [1]: import ibis
[ins] In [2]: con = ibis.get_backend()
[ins] In [3]: con
Out[3]: <ibis.backends.duckdb.Backend at 0x1081d5750>
[ins] In [4]: con.con
Out[4]: <duckdb.duckdb.DuckDBPyConnection at 0x10c8dbab0>
I don't know if using this while using Ibis at the same time is well-defined behavior
then there is an open issue w/ this ask: https://github.com/ibis-project/ibis/issues/8877
@lostmygithubaccount Ah yes, thanks for pointing me to that issue, that is exactly what I'd need. I'll comment there.