datasets Add ability to read-write to SQL databases.

Fixes #3094

Add ability to read/write to SQLite files and also read from any SQL database supported by SQLAlchemy.

I didn't add SQLAlchemy as a dependence as it is fairly big and it remains optional.

I also recorded a Loom to showcase the feature.

https://www.loom.com/share/f0e602c2de8a46f58bca4b43333d541f

Sep 03 '22 19:09 Dref360

The documentation is not available anymore as the PR was closed or merged.

Sep 03 '22 19:09 HuggingFaceDocBuilderDev

Ah CI runs with pandas=1.3.5 which doesn't return the number of row inserted.

Sep 03 '22 20:09 Dref360

wow this is super cool!

Sep 05 '22 10:09 julien-c

@lhoestq I'm getting error in integration tests, not sure if it's related to my PR. Any help would be appreciated :)

if not self._is_valid_token(token):
>               raise ValueError("Invalid token passed!")
E               ValueError: Invalid token passed!

Sep 06 '22 14:09 Dref360

I just relaunched the tests, it should be fixed now

Sep 06 '22 14:09 lhoestq

Thanks a lot for working on this!

I have some concerns with the current design:

Besides SQLite, the loader should also work with the other engines supported by SQLAlchemy. (A better name for it in the current state would be sqlite :))
It should support arbitrary queries/table names - only the latter currently works.
Exposing this loader as a packaged builder (load_dataset("sql", ...)) is not a good idea for the following reasons:
- Considering the scenario where a table with the same name is present in multiple files is very unlikely, the data files resolution is not needed here. And if we remove that, what the name of the default split should be? "train"?
- load_dataset("sql", ...) also implies that streaming should work, but that's not the case. And I don't think we can change that, considering how hard it is to make SQLite files streamable.

All this makes me think we shouldn't expose this builder as a packaged module and, instead, limit the API to Dataset.from_sql/Dataset.to_sql (with the signatures matching the ones in pandas as much as possible; regarding this, note that SQLAlchemy connections are not hashable/picklable, which is required for caching, but I think it's OK only to allow URI strings as connections to bypass that (Dask has the same limitation).

WDYT?

Sep 07 '22 18:09 mariosasko

Hi @mariosasko thank you for your review.

I agree that load_dataset('sql',...) is a bit weird and I would be happy to remove it. To be honest, I only added it when I saw that it was the preferred way in loading.mdx.

I agree that the SELECT should be a parameters as well. I'll add it.

So far, only Dataset.to_sql explicitly supports any SQLAlchemy Connexion, I'm pretty sure that Dataset.from_sql would work with a Connexion as well, but it would break the typing from the parent class which is path_or_paths: NestedDataStructureLike[PathLike]. I would prefer not to break this API Contract.

I will have time to work on this over the weekend. Please let me know what you think if I do the following:

Remove load_dataset('sql', ...) and edit the documentation to use to_sql, from_sql.
Tentatively make Dataset.from_sql typing work with SQLAlchemy Connexion.
Add support for custom queries (Default would be SELECT * FROM {table_name}).

Cheers!

Sep 08 '22 01:09 Dref360

Perhaps after we merge https://github.com/huggingface/datasets/pull/4957 (Done!), you can subclass AbstractDatasetInputStream instead of AbstractDatasetReader to not break the contract with the connection object. Also, let's avoid having the default value for the query/table (you can set it to None in the builder and raise an error in the builder config's __post_init__ if it's not provided). Other than that, sounds good!

Sep 08 '22 17:09 mariosasko

@Dref360 I've made final changes/refinements to align the SQL API with Pandas/Dask. Let me know what you think.

Sep 21 '22 15:09 mariosasko

Thank you so much! I was missing a lot of things sorry about that. LGTM

Sep 21 '22 17:09 Dref360

I think we can merge if the tests pass.

One last thing I would like to get your opinion on - currently, if SQLAlchemy is not installed, the missing dependency error will be thrown inside pandas.read_sql. Do you think we should be the ones throwing this error, e.g. after the imports in packaged_modules/sql/sql.py if SQLALCHEMY_AVAILABLE is False (note that this would mean making sqlalchemy a required dependency for the docs to be able to add SqlConfig to the package reference)?

Sep 23 '22 13:09 mariosasko

One last thing I would like to get your opinion on - currently, if SQLAlchemy is not installed, the missing dependency error will be thrown inside pandas.read_sql

Is sqlalchemy always required for pd.read_sql ? If so, I think we can raise the error on our side. But sqlalchemy should still be an optional dependency for datasets IMO

Sep 23 '22 14:09 lhoestq

@lhoestq

Is sqlalchemy always required for pd.read_sql ? If so, I think we can raise the error on our side.

In our case, it's always required as we only support database URIs.

But sqlalchemy should still be an optional dependency for datasets IMO

Yes, it will remain optional for datasets but will be required for building the docs (as iss3fs, for instance).

Sep 23 '22 14:09 mariosasko

Ok I see ! Sounds good :)

Sep 26 '22 09:09 lhoestq