Executing code while there is a Pandas DataFrame open causes onDidSchemaUpdate to fire for the DataExplorerClientInstance
Positron Version:
Positron Version: 2024.07.0 (Universal) build 95 Code - OSS Version: 1.91.0 Commit: 5bc99e659e3b53a09349a08c205b3ac267b79b46 Date: 2024-07-24T04:48:40.681Z Electron: 29.4.0 Chromium: 122.0.6261.156 Node.js: 20.9.0 V8: 12.2.281.27-electron.0 OS: Darwin arm64 23.5.0
Steps to reproduce the issue:
- Open a Pandas
DataFrame.flights.parquetwill do fine.
import pandas as pd
flights = pd.read_parquet('flights.parquet')
- Execute code in the Python runtime that has nothing to do with the
DataFramesuch asa = 100andb = 100. You will notice that, regardless of what you execute, theDataExplorerClientInstancewill fire aonDidSchemaUpdateevent. Here's a movie of this happening (note that I added aconsole.logcall in theDataExplorerCacheonDidSchemaUpdateevent handler to log the event being fired):
https://github.com/user-attachments/assets/d650e405-0e67-4904-bab9-dd264b51c0a4
Pressing Enter in the Python runtime does the same thing:
https://github.com/user-attachments/assets/1abb7d5d-5fcc-4047-bcc4-2b02e3d85f5a
What did you expect to happen?
I expected no events to be fired on the DataExplorerClientInstance.
Were there any error messages in the Output panel or Developer Tools console?
No.
I think this is a consequence of our "large object change detection" policy because we cannot cheaply rule out the possibility of in-place mutations (data changes or schema changes) of pandas DataFrame objects. You can see a unit test here that asserts that on every code execution, data explorers are notified to invalidate their schemas (which implicitly also invalidates the data cache), so the next time they are activated by the user, they will need to refresh themselves
https://github.com/posit-dev/positron/blob/main/extensions/positron-python/python_files/positron/positron_ipykernel/tests/test_data_explorer.py#L1557
The "budget" for how many "cells" of data we are willing to copy/snapshot to do change detection is set here
https://github.com/posit-dev/positron/blob/main/extensions/positron-python/python_files/positron/positron_ipykernel/variables.py#L68
There was some refactoring where these are now supposed to pass through the variables update event as "unevaluated"
https://github.com/posit-dev/positron/blob/main/positron/comms/variables-frontend-openrpc.json#L24
The "unevaluated" variables (which were too big to snapshot / do change detection) should not cause the variables pane to flash.
So in any case, in the scenario where we have more than 10M cells worth of data loaded into memory, any data structure that run over that 10M budget will have schema updates fired to avoid the buggy scenarios where the user modifies a DataFrame in place (by adding or removing columns, or something else) and we don't want an open data explorer to fall out of sync.
There isn't a lot user-visible to verify here but I'm leaving in the Verification state for now
Verified Fixed
Positron Version(s) : 2024.07.0-125
OS Version : OSX
Test scenario(s)
Marking verified post discussion with Product Manager
Link(s) to TestRail test cases run or created: N/A