Add ability to read-write to SQL databases.
Fixes #3094
Add ability to read/write to SQLite files and also read from any SQL database supported by SQLAlchemy.
I didn't add SQLAlchemy as a dependence as it is fairly big and it remains optional.
I also recorded a Loom to showcase the feature.
https://www.loom.com/share/f0e602c2de8a46f58bca4b43333d541f
The documentation is not available anymore as the PR was closed or merged.
Ah CI runs with pandas=1.3.5 which doesn't return the number of row inserted.
wow this is super cool!
@lhoestq I'm getting error in integration tests, not sure if it's related to my PR. Any help would be appreciated :)
if not self._is_valid_token(token):
> raise ValueError("Invalid token passed!")
E ValueError: Invalid token passed!
I just relaunched the tests, it should be fixed now
Thanks a lot for working on this!
I have some concerns with the current design:
- Besides SQLite, the loader should also work with the other engines supported by SQLAlchemy. (A better name for it in the current state would be
sqlite:)) - It should support arbitrary queries/table names - only the latter currently works.
- Exposing this loader as a packaged builder (
load_dataset("sql", ...)) is not a good idea for the following reasons:- Considering the scenario where a table with the same name is present in multiple files is very unlikely, the data files resolution is not needed here. And if we remove that, what the name of the default split should be? "train"?
-
load_dataset("sql", ...)also implies that streaming should work, but that's not the case. And I don't think we can change that, considering how hard it is to make SQLite files streamable.
All this makes me think we shouldn't expose this builder as a packaged module and, instead, limit the API to Dataset.from_sql/Dataset.to_sql (with the signatures matching the ones in pandas as much as possible; regarding this, note that SQLAlchemy connections are not hashable/picklable, which is required for caching, but I think it's OK only to allow URI strings as connections to bypass that (Dask has the same limitation).
WDYT?
Hi @mariosasko thank you for your review.
I agree that load_dataset('sql',...) is a bit weird and I would be happy to remove it. To be honest, I only added it when I saw that it was the preferred way in loading.mdx.
I agree that the SELECT should be a parameters as well. I'll add it.
So far, only Dataset.to_sql explicitly supports any SQLAlchemy Connexion, I'm pretty sure that Dataset.from_sql would work with a Connexion as well, but it would break the typing from the parent class which is path_or_paths: NestedDataStructureLike[PathLike]. I would prefer not to break this API Contract.
I will have time to work on this over the weekend. Please let me know what you think if I do the following:
- Remove
load_dataset('sql', ...)and edit the documentation to useto_sql, from_sql. - Tentatively make
Dataset.from_sqltyping work with SQLAlchemy Connexion. - Add support for custom queries (Default would be
SELECT * FROM {table_name}).
Cheers!
Perhaps after we merge https://github.com/huggingface/datasets/pull/4957 (Done!), you can subclass AbstractDatasetInputStream instead of AbstractDatasetReader to not break the contract with the connection object. Also, let's avoid having the default value for the query/table (you can set it to None in the builder and raise an error in the builder config's __post_init__ if it's not provided). Other than that, sounds good!
@Dref360 I've made final changes/refinements to align the SQL API with Pandas/Dask. Let me know what you think.
Thank you so much! I was missing a lot of things sorry about that. LGTM
I think we can merge if the tests pass.
One last thing I would like to get your opinion on - currently, if SQLAlchemy is not installed, the missing dependency error will be thrown inside pandas.read_sql. Do you think we should be the ones throwing this error, e.g. after the imports in packaged_modules/sql/sql.py if SQLALCHEMY_AVAILABLE is False (note that this would mean making sqlalchemy a required dependency for the docs to be able to add SqlConfig to the package reference)?
One last thing I would like to get your opinion on - currently, if SQLAlchemy is not installed, the missing dependency error will be thrown inside pandas.read_sql
Is sqlalchemy always required for pd.read_sql ? If so, I think we can raise the error on our side.
But sqlalchemy should still be an optional dependency for datasets IMO
@lhoestq
Is sqlalchemy always required for pd.read_sql ? If so, I think we can raise the error on our side.
In our case, it's always required as we only support database URIs.
But sqlalchemy should still be an optional dependency for datasets IMO
Yes, it will remain optional for datasets but will be required for building the docs (as iss3fs, for instance).
Ok I see ! Sounds good :)