positron icon indicating copy to clipboard operation
positron copied to clipboard

Changing data in a Pandas DataFrame doesn't fire the onDidDataUpdate event, but instead fires the onDidSchemaUpdate event...

Open softwarenerd opened this issue 1 year ago • 2 comments

Positron Version:

Positron Version: 2024.07.0 (Universal) build 95 Code - OSS Version: 1.91.0 Commit: 5bc99e659e3b53a09349a08c205b3ac267b79b46 Date: 2024-07-24T04:48:40.681Z Electron: 29.4.0 Chromium: 122.0.6261.156 Node.js: 20.9.0 V8: 12.2.281.27-electron.0 OS: Darwin arm64 23.5.0

Steps to reproduce the issue:

  1. Open a Pandas DataFrame. flights.parquet will do fine.
import pandas as pd
flights = pd.read_parquet('flights.parquet')
  1. Change an cell in the DataFrame. For example:
flights.iat[0, 0] = 2014

You will notice that the DataExplorerClientInstance will fire an onDidSchemaUpdate event even though the schema wasn't changed. Here's a movie of this happening (note that I added a console.log call in the DataExplorerCache onDidSchemaUpdate event handler to log the event being fired):

https://github.com/user-attachments/assets/9d7d5858-cf96-4bef-9e4e-b57755f7d173

What did you expect to happen?

I expected the DataExplorerClientInstance to fire an onDidDataUpdate event.

Were there any error messages in the Output panel or Developer Tools console?

No.

softwarenerd avatar Jul 24 '24 18:07 softwarenerd

As discussed in https://github.com/posit-dev/positron/issues/4144#issuecomment-2249056982, there is a certain "comparison budget" for detecting changes in data frames beyond which the backend will always fire schema update events for any data frame that has an open data explorer, because ruling out in place changes cannot be done reliably at reasonable cost in those cases.

However, here with a single data frame with ~300K rows and 19 columns, this "over budget" logic shouldn't be kicking in so this seems buggy to me. I'll have a closer look

wesm avatar Jul 24 '24 23:07 wesm

I'm going to add some logic to cache the schema for "small" schemas (< 100 columns, we can tweak the threshold based on how much we want to pay for this. 100 columns is a few milliseconds of computing time) which will enable us to detect schema changes in the event of in-place updates for the "normal" case, and we can fall back on the "refuse to guess" case for large schemas.

wesm avatar Jul 28 '24 23:07 wesm

This just fires a data update now when the dataset has fewer than 10 million total values, for very large datasets it's safer to clear caches and repaint. We can work incrementally to expand use cases where we can avoid a schema update

wesm avatar Jul 30 '24 17:07 wesm

Verified Fixed

Positron Version(s) : 2024.07.0-125
OS Version          : OSX

Test scenario(s)

Marking verified post discussion with Product Manager

Link(s) to TestRail test cases run or created: N/A

testlabauto avatar Aug 02 '24 14:08 testlabauto