lux icon indicating copy to clipboard operation
lux copied to clipboard

Ordinal Data Type

Open jinimukh opened this issue 3 years ago • 4 comments

Overview

This PR addresses #240 by adding support for the ordinal data type. Currently, the only way to set the data type to ordinal is by using df.set_data_type({"col_name": "ordinal}) functionality. Optionally, if the entries do not have a natural ordering like number or alphabetical, a custom ordering can be specified using df.set_data_type({"col_name": "ordinal}, order={"col_name": [ordered_lst]}). To visualize ordinal data types, we are using boxplots but because they are bivariate distributions, they only show up to enhance a selected visualization.

Changes

  • univariate.py: allow ordinal data types to be treated as nominal data types to create bar graphs in Occurrences tab
  • frame.py: allow the set_data_type function to take in optional order argument to specify orders on ordinal data
  • BoxPlot.py: currently only supports Altair BoxPlots
  • Compiler.py: allow the mark to be box when n_dim == 1 and n_msr == 1 and dimension_type == "ordinal"`

Example Output

Screen Shot 2021-04-14 at 1 51 27 PM

jinimukh avatar Apr 15 '21 05:04 jinimukh

Codecov Report

Merging #360 (7820f1e) into master (1dbbcb9) will decrease coverage by 0.62%. The diff coverage is 50.00%.

:exclamation: Current head 7820f1e differs from pull request most recent head 19a14d8. Consider uploading reports for the commit 19a14d8 to get more accurate results Impacted file tree graph

@@            Coverage Diff             @@
##           master     #360      +/-   ##
==========================================
- Coverage   84.46%   83.84%   -0.63%     
==========================================
  Files          51       52       +1     
  Lines        3902     3961      +59     
==========================================
+ Hits         3296     3321      +25     
- Misses        606      640      +34     
Impacted Files Coverage Δ
lux/action/univariate.py 90.38% <ø> (ø)
lux/core/series.py 53.84% <ø> (ø)
lux/interestingness/interestingness.py 87.95% <ø> (ø)
lux/vislib/matplotlib/MatplotlibRenderer.py 84.61% <0.00%> (-2.69%) :arrow_down:
lux/vislib/altair/BoxPlot.py 21.87% <21.87%> (ø)
lux/vislib/altair/AltairRenderer.py 94.59% <33.33%> (-2.59%) :arrow_down:
lux/action/enhance.py 96.87% <66.66%> (-3.13%) :arrow_down:
lux/vislib/altair/BarChart.py 82.66% <75.00%> (-2.19%) :arrow_down:
lux/core/frame.py 81.75% <81.81%> (+0.02%) :arrow_up:
lux/executor/Executor.py 79.48% <100.00%> (ø)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 1dbbcb9...19a14d8. Read the comment docs.

codecov[bot] avatar Apr 15 '21 08:04 codecov[bot]

Thanks @jinimukh!! Can we file a follow-up issue to delegate boxplot calculations to the Pandas and SQL Executor? This will help with performance by bringing down the rendering speed from the cost of a scatterplot to that of a boxplot (several summary statistics + outliers).

dorisjlee avatar Apr 26 '21 02:04 dorisjlee

I'm wondering if ordinal data types have to be a subset of nominal data? Apart from the documentation and within the actions logic (enhance and univariate), is there anything in the code that treats ordinal as a subset of nominal. For example, can we capture scenarios where ordinal data type could be a subset of temporal data type? Such as {Summer, Winter, Fall}, {Q1, Q2, …}. It would be helpful to add an example for this.

dorisjlee avatar Apr 26 '21 03:04 dorisjlee

Here's some examples that I was playing around with:

df = pd.read_csv("https://raw.githubusercontent.com/lux-org/lux-datasets/master/data/aug_test.csv")
df =df.dropna(subset=['education_level',"company_size"])
df.set_data_type({'education_level': "ordinal"}, 
                 order={'education_level': ['Primary School', 'High School', 'Masters','Graduate', 'Phd']})
df["education_level"]


df.set_data_type({'company_size': "ordinal"}, 
                 order={'company_size': [
                     '<10', '10/49', '50-99', '100-500',
                       '500-999', '1000-4999', '5000-9999','10000+'
                 ]})
df["company_size"]

I was initially a bit confused by why the boxplot was not shown for the number of records case in univariate (until we set the intent), then I realized that the boxplot didn't make sense for the ordinal data type. I wonder if it makes sense to have a bivariate ordinal data type tab, i.e., ordinal with respect to all measure values, so that the boxplot could be shown in the initial view. Otherwise, it would appear that setting the intent doesn't change anything.

dorisjlee avatar Apr 26 '21 03:04 dorisjlee