databricks-vscode icon indicating copy to clipboard operation
databricks-vscode copied to clipboard

[BUG] When a jupyter cell's output is a dataframe, the dataframe is always evaluated

Open lendle opened this issue 10 months ago • 3 comments

Describe the bug When a jupyter cell's output is a dataframe, the dataframe is evaluated. Maybe not technically a bug, because it is documented (last paragraph in this section), but there is no way to disable it. This can be very expensive and slow.

This is inconsistent with running a notebook in the dbx web ui, which apparently does not register a custom formatter for dataframes.

To Reproduce Steps to reproduce the behavior:

  1. Create or open a jupyter notebook
  2. Create a dataframe, but don't assign it to a variable, so that it is the result of the cell.
  3. Execute the cell. The df is evaluated and results are displayed

Screenshots If applicable, add screenshots to help explain your problem.

System information:

  1. Paste the output ot the Help: About command (CMD-Shift-P).
Version: 1.97.2 (Universal)
Commit: e54c774e0add60467559eb0d1e229c6452cf8447
Date: 2025-02-12T23:20:35.343Z (6 days ago)
Electron: 32.2.7
ElectronBuildId: 10982180
Chromium: 128.0.6613.186
Node.js: 20.18.1
V8: 12.8.374.38-electron.0
OS: Darwin arm64 24.3.0
  1. Databricks Extension Version 2.6.0

Databricks Extension Logs N/A

Additional context Workaround:

from IPython.core.getipython import get_ipython

html_formatter = get_ipython().display_formatter.formatters["text/html"]
try: 
    html_formatter.pop('pyspark.sql.DataFrame')
except KeyError:
    pass

try: 
    html_formatter.pop('pyspark.sql.connect.dataframe.DataFrame')
except KeyError:
    pass

At the top of the notebook.

lendle avatar Feb 19 '25 21:02 lendle

Hi, thanks for reporting it. Right now we indeed register html formatters for dataframes without any convenient way to disable. We only have a DATABRICKS_DF_DISPLAY_LIMIT env var that you can use to control how many rows to show.

It might make sense to add a new settings about dataframe formatters, which you should be able to disable.

ilia-db avatar Feb 20 '25 13:02 ilia-db

IMO it would be best to not enable this by default. It runs counter to the idea that dataframes are lazy by default, unless spark.sql.repl.eagerEval.enabled is set.

lendle avatar May 16 '25 05:05 lendle

I tried @lendle's workaround, and so far I have mixed feelings. The default behavior makes it really easy to build up an exploratory analysis using %sql cells. Maybe it would be good to mimic the Databricks web UI, where DataFrames are lazy in Python but still previewed in %sql cells.

michael-richman-lgads avatar May 27 '25 13:05 michael-richman-lgads