dask-sql
dask-sql copied to clipboard
[ENH] Provide visibility into persisted tables and simple "restart" option
Sometimes it's not clear when running a number of SQL scripts in a session whether all scripts "clean up" after themselves (dropping temp tables, unpersisting tables, de-registering UDFs, etc).
It would be useful to support Dask-SQL features that:
- List all objects persisted to the underlying Dask cluster
- Unpersist all persisted tables (perhaps also with schema level granularity)
- Reset the Context to a clear state (all tables, UDFs, schemas dropped, any objects persisted by Dask-SQL are unpersisted)
In benchmarking work, we attempted to drop all tables manually with:
for table in list(c.schema["root"].tables.keys()):
c.drop_table(table)
but (at least in dask-cudf) this didn't appear to successfully remove all references to persisted DataFrames. Memory use remained high, and only dropped after calling Dask's client.restart()
.
For #3, the user can always create a new Context, but its not clear if this incurs the overhead of restarting the underlying JVM (cc @jdye64 for comment).
@randerzander resetting the Context would not cause another JVM instance to be started. In fact the jvm instance is created before even creating a Context and rather is started as a result of the first import dask_sql ....
is called
For reference ....
>>> import dask_sql
Starting JVM from path /home/jdyer/anaconda3/envs/dask-sql/lib/server/libjvm.so...
...having started JVM
>>> c1 = dask_sql.Context()
>>> c2 = dask_sql.Context()
jvm instance is created before even creating a Context and rather is started as a result of the first import dask_sql .... is called
thanks for confirming