explorer Additional backends

Additional backends

Open cigrainger opened this issue 2 years ago • 13 comments

Explorer is primarily an API. The idea for pluggable backends was shamelessly stolen from Nx and dplyr. With Rustler precompiled, we can depend in polars but we want additional ones in the future.

So with that said, these are the backends that I think make the most sense to implement. I'm curious to hear if there are others that might make sense. For example, I've mentally written off Spark as being too difficult because I'm unfamiliar with Elixir <> JVM interop, but I'd love to hear if someone has a strategy.

[ ] LazyPolars
[ ] DataFusion/Ballista
[ ] Ecto/SQL

What about something like DuckDB? Does DataFusion have us covered for OLAP?

Sep 06 '21 23:09 cigrainger

@cigrainger big +1 to those data backends. There isn't currently an ecto adapter for kdb+. If one isn't possible I'd also throw that on the list of important backends.

Oct 08 '21 22:10 rupurt

Thanks @rupurt! I have very little experience of kdb+ except for a colleague raving about it. Could you flesh out the use case? Would you be interested in contributing?

Oct 12 '21 08:10 cigrainger

Hi @cigrainger , thanks for this amazing work! IMHO, and just for the discussion: Polars and Datafusion are similar. they have dataframe apis on top of apache arrow (data structure), so having both backends is reasonable.

The pure Elixir implementation means that the APIs would be implemented in Elixir, but the data structure could be also Apache Arrow (allowing interoperability), and it could be implemented from zero or using the rust crate.

For me, Ecto is different, as it handles persistence. I would like to have an Explorer.ecto_insert() or Explorer.from_ecto(), which would be similar to to_csv and from_csv apis, and which would be independent from the backend.

DuckDb is also persistence through SQL. Technically is very similar to polars, but the objective is to have a single file database like sqlite. I think that DuckDb could be implemented as an ecto repo, like exqlite / ecto_sqlite3, and interoperability could be handled by sql/ecto, through parquet files or directly with arrow (gRPC), but that is another project.

Oct 13 '21 17:10 matreyes

Hi @cigrainger , thanks for all the awesome work!

As for the back-end for DataFusion, can checkout this library that already have Elixir bindings for Apache Arrow, Parquet and DataFusion.

elixir-arrow

It seems like already implements the basics.

Jan 01 '22 08:01 emilioforrer

Thanks for that @emilioforrer! I actually used @treebee's fork of ex_polars when initially building Explorer! I didn't realise they were also working on this. I'll look into it.

Jan 04 '22 01:01 cigrainger

What we would need to do to start working on an explorer_ecto backend? 👀 (btw, is it too early to start that?)

Feb 12 '22 21:02 kimjoaoun

@kimjoaoun I'd like to make some progress on #54 first, which I've started as of last week. It should inform the other approaches.

Feb 13 '22 22:02 cigrainger

One thing to consider is the approach for testing; I worked on a pure Elixir backend for Series this weekend just to get familiar with it, and testing a different backend is a challenge given all the doctests (hard to just implement one thing at a time), a test that relies on the underlying RNG implementation here, a test that is tied to the DataFrame implementation there (I think, that one could just be me) and so on.

I don't pretend to know how to tackle that, but it would definitely be a thing to think through (or at least document if it is already more doable than I am making it out to be) as a precursor to developing additional backends.

Edit: It's not that hard to test one thing at a time with doctests; I'm sure I would never have found this if I hadn't complained publicly first.

Feb 14 '22 01:02 srowley

Thanks @srowley good point.

Just as a PSA I think we'll skip the pure Elixir backend because @philss has been working on precompilation, so the pain point we wanted to address should be moot.

Feb 14 '22 02:02 cigrainger

I also think it may be easier to go with Postgrex/Myxql directly rather than Ecto. I have a hunch that if we use Ecto we will be mostly fighting against its DSL and we really only need a small subset of what Ecto provides.

Feb 14 '22 09:02 josevalim

Hmmm.. my thinking was that the Ecto DSL was exactly what would make it more approachable. For example, a big chunk of dplyr is the way it builds composable queries and translates to SQL: https://dbplyr.tidyverse.org/articles/sql-translation.html#single-table-verbs. So in the same way that we can leverage the polars lazy API and build up queries against in memory dataframes, we can leverage the ecto DSL and build up queries against the db. But I agree there will likely be some fights with the DSL. I just think I'd rather use it than reimplement parts.

Feb 14 '22 09:02 cigrainger

As an example, I think Ecto queries do not allow the field names to be strings. So at least this would need to change. Plus Ecto brings changesets, schemas, transactions, the necessity to define a repository per connection... and I think none of this is actually necessary by Explorer? The only part that is really necessary is the AST to SQL layer and that's the smallest problem Ecto (ecto_sql) solves. :) The other part that we would need is managing connections, queries, encoding/decoding, but this is done by the adapters.

Feb 14 '22 11:02 josevalim

That makes a lot of sense. It definitely seems like it would bring a lot of unnecessary baggage. I suppose I'm just a bit daunted by building up an AST from Explorer functions and translating them to SQL. If I'm overestimating that task, then absolutely happy to skip Ecto and just rely on the adapters.

Feb 14 '22 19:02 cigrainger

HDF5 file format would be useful.

https://en.m.wikipedia.org/wiki/Hierarchical_Data_Format

Apr 08 '23 17:04 kevinkirkup

Are there any resources besides the current polars backend that can help me get started on writing additional backends. I'm pretty familiar with DuckDB and would love to get started on writing one.

I'd also like to create an ODBC backend which would be useful for connecting to many other database engines.

Jun 07 '23 23:06 rupurt

Those would be very welcome! At a very high level the backend just needs to implement the defined behaviours. E.g. Explorer.Backend.DataFrame. And at first it doesn't need to implement the whole thing. Personally, I'd start at one or two IO functions, then start adding simple stuff from there (e.g. select).

Feel free to ping me (EEF slack it's probably the best place) to discuss!

Jun 08 '23 04:06 cigrainger

@rupurt note I am working on ADBC adapter for Explorer+Polars to cover the database connectivity bits (I am often livestreaming it on twitch.tv/josevalim).

Jun 08 '23 07:06 josevalim

@josevalim sweet. Do you have a link to a repo? I could probably use that to get started on the ODBC one. I've written an ecto adapter for Db2 (closed source :{) and I'm pretty familiar with the spec and the current shortcomings with the erlang ODBC driver.

It sounds like the ADBC one would probably make a separate DuckDB backend obsolute.

Jun 08 '23 15:06 rupurt

We are working on github.com/cocoa-xu/adbc and there is a branch. But it is still early stage and very WIP. I think the ADBC is orthogonal to the DuckDB and Polars ones.

Jun 08 '23 15:06 josevalim

DuckDB supports ADBC as of 0.8.

Jun 09 '23 04:06 cigrainger

Nice find. Polars for Python supports it too: https://pola-rs.github.io/polars-book/user-guide/io/database/

Jun 09 '23 06:06 josevalim

Awesome. Thanks @josevalim. The ADBC repo looks like a fantastic resource as a baseline for ODBC.

Jun 09 '23 17:06 rupurt

@josevalim @cigrainger I took a bit of a different path. I ended up creating a DuckDB extension so that it can be used in more contexts. That should mean that once the ADBC backend is ready we can connect to ODBC datasources in Elixir through the extension.

https://github.com/rupurt/odbc-scanner-duckdb-extension

Jul 14 '23 19:07 rupurt

We did land Adbc support on main. :) We also added an API for loading data from an ArrowStream, which could perhaps be a mechanism to integrate Duck and Polars in the future. I think we can close this issue for now. Lazy backend is covered and, we could explore others, but I don't think it is a priority given where the project is. :)

Jul 14 '23 20:07 josevalim

Sweet. My next step is to get this working in Elixir so hopefully what we currently have is enough to get this going.

Jul 14 '23 20:07 rupurt

I'm looking for the equivalent of to_sql(dataframe) in Pandas. Essentially, my goal is to write dataframes to MySQL, Postgres, etc. Is there any way to do this at this time?

Jul 19 '23 18:07 abrunner94

It is not possible currently. :)

Jul 19 '23 18:07 josevalim

Ah bummer! Will it ever be part of Explorer or is anything similar planned?

Jul 19 '23 19:07 abrunner94

It can be a custom backend, if you wanna tackle it. The Lazy backend here is already capable of building a query, then you would need a translation layer to SQL (depending on the underlying SQL database).

Jul 19 '23 19:07 josevalim

explorer explorer copied to clipboard

Additional backends

explorer
explorer copied to clipboard