datachain icon indicating copy to clipboard operation
datachain copied to clipboard

Decouple DataChain and DatasetQuery

Open dmpetrov opened this issue 1 year ago • 7 comments

It's needed for:

  • a clean API - now these two concepts are too different to fit a single API
  • a parts or DatasetQuery needs some redesign and should not be exposed to users
  • depreciate old, flat signal based API: add_signals() and such

ToDo:

  • [ ] DataChain should have params (and possible output values connected to metrics)
  • [ ] DataChain should be reusable from another chains

dmpetrov avatar Jun 28 '24 21:06 dmpetrov

A couple questions:

  • what should be the role of DatasetQuery
  • what's its expected relationship with DataChain (base class, attribute, something else...)?

rlamy avatar Jul 01 '24 12:07 rlamy

Also, DatasetQuery and DataChain are no longer compatible with one another, and build different kinds of datasets. One could be used as a superset of the other, but with the introduction of signals_schema, that is no longer true.

The decisions here might affect https://github.com/iterative/dvcx/issues/1597 and similar other issues.

skshetry avatar Jul 01 '24 14:07 skshetry

DataChain should become a pipeline. Something like Parameterized Pipeline in DVC but in pythonic way. While DatasetQuery (name is still not good) is a lower level structure for running DB queries.

  • what's its expected relationship with DataChain

It's likely attribute. And datachain might potentially have multiple datasets / datasetqueries.

DatasetQuery and DataChain are no longer compatible with one another,

💯 That's one of the reasons why these needs to be decoupled.

dmpetrov avatar Jul 01 '24 20:07 dmpetrov

DatasetQuery is currently much more than just running DB queries so we would need to remove most of the logic there. My question is why do we even need 2 levels of API. It just complicates things and requires 2 levels of tests etc. If we need something to just run DB queries we have metastore / warehouse classes that are doing that atm.

Why don't we just merge those two i.e move needed stuff from DatasetQuery to Datachain and remove DatasetQuery altogether?

ilongin avatar Jul 03 '24 20:07 ilongin

DatasetQuery is currently much more than just running DB queries so we would need to remove most of the logic there.

DatasetQuery is an internal API, and is not going to be exposed to the users. So, as you have said, we should remove all the logic that does anything other than working with queries/udfs, etc.

Why don't we just merge those two i.e move needed stuff from DatasetQuery to Datachain and remove DatasetQuery altogether?

Dmitry wanted to have an internal abstraction over sqlalchemy that handles all the querying, udfs, chunking, etc. Querying can be done through sqlalchemy already as you have mentioned, but DatasetQuery needs to do more than that, which to me makes sense.


Tbh, I only learnt about it this week, that DatasetQuery is not a user-focused API. I have added some APIs like from_dataframe/from_pandas etc. that is no longer useful, and can be removed. Similarly, to_records() and other IO related APIs can be moved to DataChain or in some other places.

skshetry avatar Jul 04 '24 05:07 skshetry

@skshetry It was user-focused API but I would say it's just deprecated recently in favor of the new Datachain

ilongin avatar Jul 04 '24 08:07 ilongin

Since we're now getting rid of most of the code that expected DatasetQuery to be user-facing, it might be time to tackle this. A simple first step should be to just remove the inheritance relationship and add a _query: DatasetQuery attribute to DataChain (and replace all super calls, etc.) That should make it easier to see which parts should be moved from DatasetQuery and what its exact role (and name!) should be.

rlamy avatar Sep 18 '24 14:09 rlamy