positron icon indicating copy to clipboard operation
positron copied to clipboard

Executing code while there is a Pandas DataFrame open causes onDidSchemaUpdate to fire for the DataExplorerClientInstance

Open softwarenerd opened this issue 1 year ago • 1 comments

Positron Version:

Positron Version: 2024.07.0 (Universal) build 95 Code - OSS Version: 1.91.0 Commit: 5bc99e659e3b53a09349a08c205b3ac267b79b46 Date: 2024-07-24T04:48:40.681Z Electron: 29.4.0 Chromium: 122.0.6261.156 Node.js: 20.9.0 V8: 12.2.281.27-electron.0 OS: Darwin arm64 23.5.0

Steps to reproduce the issue:

  1. Open a Pandas DataFrame. flights.parquet will do fine.
import pandas as pd
flights = pd.read_parquet('flights.parquet')
  1. Execute code in the Python runtime that has nothing to do with the DataFrame such as a = 100 and b = 100. You will notice that, regardless of what you execute, the DataExplorerClientInstance will fire a onDidSchemaUpdate event. Here's a movie of this happening (note that I added a console.log call in the DataExplorerCache onDidSchemaUpdate event handler to log the event being fired):

https://github.com/user-attachments/assets/d650e405-0e67-4904-bab9-dd264b51c0a4

Pressing Enter in the Python runtime does the same thing:

https://github.com/user-attachments/assets/1abb7d5d-5fcc-4047-bcc4-2b02e3d85f5a

What did you expect to happen?

I expected no events to be fired on the DataExplorerClientInstance.

Were there any error messages in the Output panel or Developer Tools console?

No.

softwarenerd avatar Jul 24 '24 18:07 softwarenerd

I think this is a consequence of our "large object change detection" policy because we cannot cheaply rule out the possibility of in-place mutations (data changes or schema changes) of pandas DataFrame objects. You can see a unit test here that asserts that on every code execution, data explorers are notified to invalidate their schemas (which implicitly also invalidates the data cache), so the next time they are activated by the user, they will need to refresh themselves

https://github.com/posit-dev/positron/blob/main/extensions/positron-python/python_files/positron/positron_ipykernel/tests/test_data_explorer.py#L1557

The "budget" for how many "cells" of data we are willing to copy/snapshot to do change detection is set here

https://github.com/posit-dev/positron/blob/main/extensions/positron-python/python_files/positron/positron_ipykernel/variables.py#L68

There was some refactoring where these are now supposed to pass through the variables update event as "unevaluated"

https://github.com/posit-dev/positron/blob/main/positron/comms/variables-frontend-openrpc.json#L24

The "unevaluated" variables (which were too big to snapshot / do change detection) should not cause the variables pane to flash.

So in any case, in the scenario where we have more than 10M cells worth of data loaded into memory, any data structure that run over that 10M budget will have schema updates fired to avoid the buggy scenarios where the user modifies a DataFrame in place (by adding or removing columns, or something else) and we don't want an open data explorer to fall out of sync.

wesm avatar Jul 24 '24 23:07 wesm

There isn't a lot user-visible to verify here but I'm leaving in the Verification state for now

wesm avatar Jul 30 '24 17:07 wesm

Verified Fixed

Positron Version(s) : 2024.07.0-125
OS Version          : OSX

Test scenario(s)

Marking verified post discussion with Product Manager

Link(s) to TestRail test cases run or created: N/A

testlabauto avatar Aug 02 '24 14:08 testlabauto