positron
positron copied to clipboard
Epic: Data Explorer Summary Panel statistics
When the Summary Panel is expanded, it will dynamically calculate and then reveal additional summary statistics for that specific column. This is a lazy operation in the backend as it would otherwise be costly for long/wide datasets.
Summary stats will be right aligned at the decimal place:
NA: 15
Median: 14
Mean: 15.7
SD: 2.1
Min: 1.2
Max: 20.3
Completed:
- [x] Numeric in R
- [x] Numeric in Python
- [x] String in R
- [x] String in Python
- [x] Date in R
- [x] Date in Python
- [x] Datetime in R
- [x] Datetime in Python
- [x] Boolean in R
- [x] Boolean in Python
- [ ] Unknown in R
- [ ] Unknown in Python
Parent Categorical: https://github.com/posit-dev/positron/issues/3417
- [ ] Factor/Categorical in R
- [ ] Factor/Categorical in Python
Number
- Median
- Mean
- Standard Deviation (SD)
- Min
- Max
Boolean
- TRUE N (%)
- FALSE N (%)
String
- Empty: N ( this is equivalent to a
""string - implicit missing) - Unique (Number of unique strings)
String sub-category: Categorical/Factor
- Levels - Ordered/Not Ordered + number
Date or Datetime or time
- Number of unique
- Mean
- Median
- Min
- Max
- If time, timezone
Array -- holding off for now
- Number of unique
Struct -- holding off for now
- Number of unique
Unknown -- holding off for now
- Number of unique
/**
* Possible values for TypeDisplay in ColumnSchema
*/
export enum ColumnSchemaTypeDisplay {
Number = 'number',
Boolean = 'boolean',
String = 'string',
Date = 'date',
Datetime = 'datetime',
Time = 'time',
Array = 'array',
Struct = 'struct',
Unknown = 'unknown'
}
https://github.com/posit-dev/positron/blob/5143bd25007edccad12c8db7c69745b43593b38b/positron/comms/data_explorer-backend-openrpc.json#L333C7-L347C8
@softwarenerd -- I've converted the headers above to type_display enum.
I'm working on improvements in the backend protocol to better support these statistics right now.
I'm not sure it makes sense to compute number of unique values for arrays and structs for now -- there are varying degrees of ease of computing this in different backends, so I'll punt on that for now and we can address it later once we can investigate how to compute that consistently.
Sounds good! I also think it'd be interesting to hear from users on what types of metrics they'd like. I've indicated that we're holding off on the array/structs/unknowns for now
We can close this once #3021 is merged and validated.
@jthomasmock do we have a good test data that exercises all of the types and thus the column summary statistics? (including precision, null, empty, various types, etc)
we'd want QA to exhaustively cover these statistics to check their validity for the data set.
@jthomasmock do we have a good test data that exercises all of the types and thus the column summary statistics? (including precision, null, empty, various types, etc)
we'd want QA to exhaustively cover these statistics to check their validity for the data set.
I can work on this.
There are some example tests at: https://github.com/r-lib/pillar/blob/main/tests/testthat/test-format_decimal.R
@jthomasmock is there still work to do for Beta on this now that #3021 is merged and validated? (we do need tests but we can close this without them)
@jmcphers I think we are still missing date/datetime stats in: Positron Version: 2024.05.0 (Universal) build 1307
I could pick the backend side for those, I'm assuming @wesm is not working on it yet, right?
I'm working on the float formatting as we speak, so feel free to pick this up
The checkboxes above are the missing stats as of 2024-05-29. Boolean, date, datetime, factor/categorical, and unknown
Factor/Categorical summary statistics should not delegate to the summary stats for the display type (e.g. ColumnDisplayType.String, or for pandas whatever the type of the categories is). What other items need to be addressed to resolve this issue?
I think we can close this out since we've implemented the required items in R/Python and open new issues if there are data types users would like to see.