[BUG] When a jupyter cell's output is a dataframe, the dataframe is always evaluated
Describe the bug When a jupyter cell's output is a dataframe, the dataframe is evaluated. Maybe not technically a bug, because it is documented (last paragraph in this section), but there is no way to disable it. This can be very expensive and slow.
This is inconsistent with running a notebook in the dbx web ui, which apparently does not register a custom formatter for dataframes.
To Reproduce Steps to reproduce the behavior:
- Create or open a jupyter notebook
- Create a dataframe, but don't assign it to a variable, so that it is the result of the cell.
- Execute the cell. The df is evaluated and results are displayed
Screenshots If applicable, add screenshots to help explain your problem.
System information:
- Paste the output ot the
Help: Aboutcommand (CMD-Shift-P).
Version: 1.97.2 (Universal)
Commit: e54c774e0add60467559eb0d1e229c6452cf8447
Date: 2025-02-12T23:20:35.343Z (6 days ago)
Electron: 32.2.7
ElectronBuildId: 10982180
Chromium: 128.0.6613.186
Node.js: 20.18.1
V8: 12.8.374.38-electron.0
OS: Darwin arm64 24.3.0
- Databricks Extension Version
2.6.0
Databricks Extension Logs N/A
Additional context Workaround:
from IPython.core.getipython import get_ipython
html_formatter = get_ipython().display_formatter.formatters["text/html"]
try:
html_formatter.pop('pyspark.sql.DataFrame')
except KeyError:
pass
try:
html_formatter.pop('pyspark.sql.connect.dataframe.DataFrame')
except KeyError:
pass
At the top of the notebook.
Hi, thanks for reporting it. Right now we indeed register html formatters for dataframes without any convenient way to disable. We only have a DATABRICKS_DF_DISPLAY_LIMIT env var that you can use to control how many rows to show.
It might make sense to add a new settings about dataframe formatters, which you should be able to disable.
IMO it would be best to not enable this by default. It runs counter to the idea that dataframes are lazy by default, unless spark.sql.repl.eagerEval.enabled is set.
I tried @lendle's workaround, and so far I have mixed feelings. The default behavior makes it really easy to build up an exploratory analysis using %sql cells. Maybe it would be good to mimic the Databricks web UI, where DataFrames are lazy in Python but still previewed in %sql cells.