dask-sql icon indicating copy to clipboard operation
dask-sql copied to clipboard

[ENH] Provide visibility into persisted tables and simple "restart" option

Open randerzander opened this issue 3 years ago • 2 comments

Sometimes it's not clear when running a number of SQL scripts in a session whether all scripts "clean up" after themselves (dropping temp tables, unpersisting tables, de-registering UDFs, etc).

It would be useful to support Dask-SQL features that:

  1. List all objects persisted to the underlying Dask cluster
  2. Unpersist all persisted tables (perhaps also with schema level granularity)
  3. Reset the Context to a clear state (all tables, UDFs, schemas dropped, any objects persisted by Dask-SQL are unpersisted)

In benchmarking work, we attempted to drop all tables manually with:

for table in list(c.schema["root"].tables.keys()):
    c.drop_table(table)

but (at least in dask-cudf) this didn't appear to successfully remove all references to persisted DataFrames. Memory use remained high, and only dropped after calling Dask's client.restart().

For #3, the user can always create a new Context, but its not clear if this incurs the overhead of restarting the underlying JVM (cc @jdye64 for comment).

randerzander avatar Feb 14 '22 15:02 randerzander

@randerzander resetting the Context would not cause another JVM instance to be started. In fact the jvm instance is created before even creating a Context and rather is started as a result of the first import dask_sql .... is called

For reference ....

>>> import dask_sql
Starting JVM from path /home/jdyer/anaconda3/envs/dask-sql/lib/server/libjvm.so...
...having started JVM
>>> c1 = dask_sql.Context()
>>> c2 = dask_sql.Context()

jdye64 avatar Feb 14 '22 15:02 jdye64

jvm instance is created before even creating a Context and rather is started as a result of the first import dask_sql .... is called

thanks for confirming

randerzander avatar Feb 14 '22 15:02 randerzander