dask-sql icon indicating copy to clipboard operation
dask-sql copied to clipboard

[ENH] Consider not persisting by default when using create table as select syntax.

Open ayushdg opened this issue 3 years ago • 4 comments

Is your feature request related to a problem? Please describe. Based on discussions in #218 dask-sql no longer persists by default when creating tables from dataframes using the context.create_table syntax. However when using the create_table syntax from sql the behavior can be slightly different.

CREATE TABLE my_table WITH (...)

will not persist the table by default unless the param persist is passed in the query.

On the other hand

CREATE TABLE my_table AS ( SELECT * from table_2)

does persist my_table.

Describe the solution you'd like Given that the persist behavior is slightly different for similar sounding operations CREATE_TABLE in different situations can be confusing and it might make sense to standardize on this.

Describe alternatives you've considered A workaround to the issue above is using the CREATE VIEW as SELECT syntax which does not persist the table. Mentioned in the dask-sql docs here.

Additional context Since the CTAS syntax does not support custom arguments with the WITH keyword similar to CREATE TABLE WITH I'm not sure how easy it would be to implement this feature in the sql grammar.

here's the snippet that actually sets this to true for persisting: https://github.com/dask-contrib/dask-sql/blob/2010dbd6471457dd50da31385b762f0ebbc3c91f/planner/src/main/codegen/includes/create.ftl#L27-L28

ayushdg avatar Oct 20 '21 17:10 ayushdg

Hi @ayushdg

That's really great catch, And you Absolutely got it right!!.

Currently, persist parameter is hardcoded, which can be either true or false for CreateTableAs, User cannot change the parameter on the fly like CreateTable

I am already seeing your concern here, How about re-using the dictionary config logic you mentioned here https://github.com/dask-contrib/dask-sql/issues/241#issuecomment-944504551

we may have key-value pair of ({persist: False}) in the config for each schema, which gives users a chance to change the persist behavior but I think there is a catch 🤔 , this solution works for the overall schema level, but what if the user needs to persist the dataframe for one table and not for another table?

Can we also consider providing the ability of SET statements? which are used in other engines like spark Reference, I think spark also takes the global config while initializing the sparkContext and can also allow users to change the configurations using `SET statements dynamically.

An example set statement which I have in mind currently is as below, we will try to break this down and save it in the config as key-value pair for every schema. ( Though not sure whether to accept the configurations for table level or not, seems verbose 😅 )

SET dask.sql.{schema_name}.{table_name}.persist = true;

Happy to know everyone's thoughts and suggestions :)

rajagurunath avatar Oct 22 '21 17:10 rajagurunath

happy to do this one

jdye64 avatar Nov 02 '21 14:11 jdye64

@ayushdg could this possibly be a candidate for your latest dask configuration work?

jdye64 avatar Feb 15 '22 00:02 jdye64

In some sense yes. There's two aspects to this work:

  1. Updating this logic on the Java side to accept a config option and potentially also allowing users to set this config within sql itself using the set syntax as discussed here: SET dask.sql.{schema_name}.{table_name}.persist = true;

  2. Allowing users to pass this config from the python side.

The latest config updates will make the second aspect work better and allow users to provide the behavior on a per query basis. I'll need to think a bit more about how it can be done on a per table basis.

ayushdg avatar Feb 15 '22 00:02 ayushdg