Changing data in a Pandas DataFrame doesn't fire the onDidDataUpdate event, but instead fires the onDidSchemaUpdate event...
Positron Version:
Positron Version: 2024.07.0 (Universal) build 95 Code - OSS Version: 1.91.0 Commit: 5bc99e659e3b53a09349a08c205b3ac267b79b46 Date: 2024-07-24T04:48:40.681Z Electron: 29.4.0 Chromium: 122.0.6261.156 Node.js: 20.9.0 V8: 12.2.281.27-electron.0 OS: Darwin arm64 23.5.0
Steps to reproduce the issue:
- Open a Pandas DataFrame.
flights.parquetwill do fine.
import pandas as pd
flights = pd.read_parquet('flights.parquet')
- Change an cell in the
DataFrame. For example:
flights.iat[0, 0] = 2014
You will notice that the DataExplorerClientInstance will fire an onDidSchemaUpdate event even though the schema wasn't changed. Here's a movie of this happening (note that I added a console.log call in the DataExplorerCache onDidSchemaUpdate event handler to log the event being fired):
https://github.com/user-attachments/assets/9d7d5858-cf96-4bef-9e4e-b57755f7d173
What did you expect to happen?
I expected the DataExplorerClientInstance to fire an onDidDataUpdate event.
Were there any error messages in the Output panel or Developer Tools console?
No.
As discussed in https://github.com/posit-dev/positron/issues/4144#issuecomment-2249056982, there is a certain "comparison budget" for detecting changes in data frames beyond which the backend will always fire schema update events for any data frame that has an open data explorer, because ruling out in place changes cannot be done reliably at reasonable cost in those cases.
However, here with a single data frame with ~300K rows and 19 columns, this "over budget" logic shouldn't be kicking in so this seems buggy to me. I'll have a closer look
I'm going to add some logic to cache the schema for "small" schemas (< 100 columns, we can tweak the threshold based on how much we want to pay for this. 100 columns is a few milliseconds of computing time) which will enable us to detect schema changes in the event of in-place updates for the "normal" case, and we can fall back on the "refuse to guess" case for large schemas.
This just fires a data update now when the dataset has fewer than 10 million total values, for very large datasets it's safer to clear caches and repaint. We can work incrementally to expand use cases where we can avoid a schema update
Verified Fixed
Positron Version(s) : 2024.07.0-125
OS Version : OSX
Test scenario(s)
Marking verified post discussion with Product Manager
Link(s) to TestRail test cases run or created: N/A