positron icon indicating copy to clipboard operation
positron copied to clipboard

Epic: Data Explorer Summary Panel statistics

Open jthomasmock opened this issue 1 year ago • 13 comments

When the Summary Panel is expanded, it will dynamically calculate and then reveal additional summary statistics for that specific column. This is a lazy operation in the backend as it would otherwise be costly for long/wide datasets.

Summary stats will be right aligned at the decimal place:

NA:           15
Median:       14
Mean:         15.7
SD:            2.1
Min:           1.2
Max:          20.3

Completed:

  • [x] Numeric in R
  • [x] Numeric in Python
  • [x] String in R
  • [x] String in Python
  • [x] Date in R
  • [x] Date in Python
  • [x] Datetime in R
  • [x] Datetime in Python
  • [x] Boolean in R
  • [x] Boolean in Python
  • [ ] Unknown in R
  • [ ] Unknown in Python

Parent Categorical: https://github.com/posit-dev/positron/issues/3417

  • [ ] Factor/Categorical in R
  • [ ] Factor/Categorical in Python

Number

  • Median
  • Mean
  • Standard Deviation (SD)
  • Min
  • Max

Boolean

  • TRUE N (%)
  • FALSE N (%)

String

  • Empty: N ( this is equivalent to a "" string - implicit missing)
  • Unique (Number of unique strings)

String sub-category: Categorical/Factor

  • Levels - Ordered/Not Ordered + number

Date or Datetime or time

  • Number of unique
  • Mean
  • Median
  • Min
  • Max
  • If time, timezone

Array -- holding off for now

  • Number of unique

Struct -- holding off for now

  • Number of unique

Unknown -- holding off for now

  • Number of unique

jthomasmock avatar Jan 30 '24 04:01 jthomasmock

/**
 * Possible values for TypeDisplay in ColumnSchema
 */
export enum ColumnSchemaTypeDisplay {
	Number = 'number',
	Boolean = 'boolean',
	String = 'string',
	Date = 'date',
	Datetime = 'datetime',
	Time = 'time',
	Array = 'array',
	Struct = 'struct',
	Unknown = 'unknown'
}

softwarenerd avatar Feb 27 '24 19:02 softwarenerd

https://github.com/posit-dev/positron/blob/5143bd25007edccad12c8db7c69745b43593b38b/positron/comms/data_explorer-backend-openrpc.json#L333C7-L347C8

jthomasmock avatar Feb 27 '24 19:02 jthomasmock

@softwarenerd -- I've converted the headers above to type_display enum.

jthomasmock avatar Feb 27 '24 20:02 jthomasmock

I'm working on improvements in the backend protocol to better support these statistics right now.

I'm not sure it makes sense to compute number of unique values for arrays and structs for now -- there are varying degrees of ease of computing this in different backends, so I'll punt on that for now and we can address it later once we can investigate how to compute that consistently.

wesm avatar Apr 02 '24 16:04 wesm

Sounds good! I also think it'd be interesting to hear from users on what types of metrics they'd like. I've indicated that we're holding off on the array/structs/unknowns for now

jthomasmock avatar Apr 02 '24 17:04 jthomasmock

We can close this once #3021 is merged and validated.

jthomasmock avatar May 09 '24 19:05 jthomasmock

@jthomasmock do we have a good test data that exercises all of the types and thus the column summary statistics? (including precision, null, empty, various types, etc)

we'd want QA to exhaustively cover these statistics to check their validity for the data set.

petetronic avatar May 16 '24 19:05 petetronic

@jthomasmock do we have a good test data that exercises all of the types and thus the column summary statistics? (including precision, null, empty, various types, etc)

we'd want QA to exhaustively cover these statistics to check their validity for the data set.

I can work on this.

There are some example tests at: https://github.com/r-lib/pillar/blob/main/tests/testthat/test-format_decimal.R

jthomasmock avatar May 17 '24 01:05 jthomasmock

@jthomasmock is there still work to do for Beta on this now that #3021 is merged and validated? (we do need tests but we can close this without them)

jmcphers avatar May 29 '24 18:05 jmcphers

@jmcphers I think we are still missing date/datetime stats in: Positron Version: 2024.05.0 (Universal) build 1307

image

jthomasmock avatar May 29 '24 19:05 jthomasmock

I could pick the backend side for those, I'm assuming @wesm is not working on it yet, right?

dfalbel avatar May 29 '24 20:05 dfalbel

I'm working on the float formatting as we speak, so feel free to pick this up

wesm avatar May 29 '24 20:05 wesm

The checkboxes above are the missing stats as of 2024-05-29. Boolean, date, datetime, factor/categorical, and unknown

jthomasmock avatar May 29 '24 21:05 jthomasmock

Factor/Categorical summary statistics should not delegate to the summary stats for the display type (e.g. ColumnDisplayType.String, or for pandas whatever the type of the categories is). What other items need to be addressed to resolve this issue?

wesm avatar Mar 21 '25 20:03 wesm

I think we can close this out since we've implemented the required items in R/Python and open new issues if there are data types users would like to see.

jthomasmock avatar Apr 30 '25 01:04 jthomasmock