dask-sql icon indicating copy to clipboard operation
dask-sql copied to clipboard

[ENH] Support VALUES query

Open jonmmease opened this issue 2 years ago • 2 comments

Is your feature request related to a problem? Please describe. Mostly for testing purposes, it would be great if dask-sql would support creating tables of inline data using the VALUES keyword.

Describe the solution you'd like In many SQL dialects (including DataFusion, Postgres, and DuckDB), it's possible to construct tables from literal values using the VALUES keyword. See https://www.postgresql.org/docs/current/queries-values.html.

For example:

SELECT * FROM (VALUES (1, 2), (1, 3)) as tbl(column1, column2)

In the DataFusion CLI, this evaluates to

+---------+---------+
| column1 | column2 |
+---------+---------+
| 1       | 2       |
| 1       | 3       |
+---------+---------+

This isn't currently supported in dask-sql. For example:

from dask_sql import Context
c = Context()
result = c.sql(r"""
SELECT * FROM (VALUES (1, 2), (1, 3)) as tbl(column1, column2)
""")
...
NotImplementedError: No relational conversion for node type Values available (yet).

Describe alternatives you've considered None

Additional context I'm in the early stages of adding SQL support to VegaFusion, and I'd like to test SQL dialect generation using self-contained queries that include small inline datasets.

jonmmease avatar Jan 27 '23 14:01 jonmmease

Thanks for raising this issue. As mentioned the datafusion planner supports these kind of nodes and it shouldn't be too hard to add an implementation on the dask-sql side. Is your current plan to use these queries with cpu backed dask dataframes or gpu backed dask-cudf dataframes as well?

The primary reason I'm asking is because it's easier to default to creating cpu backed dask dataframes by default, since we don't have a good api today to allow users specifying gpu tables for inline cases like these.

ayushdg avatar Jan 31 '23 18:01 ayushdg

For my purposes the CPU backend would be preferable, and since these would necessarily be small datasets, my guess is that this is probably appropriate in general.

jonmmease avatar Feb 01 '23 19:02 jonmmease