pyjanitor icon indicating copy to clipboard operation
pyjanitor copied to clipboard

[ENH] adorn functions

Open thatlittleboy opened this issue 3 years ago • 6 comments

Brief Description

There are a few adorn_* functions from R's janitor that are not yet ported over to pyjanitor. Janitor docs here.

I'm specifically looking at:

  • adorn_totals: adds a "total" column to either the rows, the columns, or both
  • adorn_percentages: converts the cell values into percentages, calculated along either axis or over the entire dataframe. In the R formulation, these are floats between 0 and 1, not the 0-100 percentages.
  • adorn_pct_formatting: formats the 0 to 1 values into the 0 to 100 percentage values, with rounding/formatting options
  • adorn_ns: adds the raw counts back into the cell values (meant to be run after adorn_percentages), so each cell has both percentage & count info, like "56 (24.3%)" for example.

I imagine these might be particularly useful for those doing data reporting. These should go into the functions module.

Example API

In pyjanitor, I don't think having four separate functions work (how to enforce that adorn_ns comes after adorn_percentages? and where would we get the counts required for adorn_ns? etc.).

Perhaps we could just do a adorn_totals, and an adorn_percentages (which encapsulates the behaviour of adorn_pct_formatting and adorn_ns as well, controlled via function parameters).

adorn_totals

This function should mirror the R function almost 1-1.

>>> df = pd.DataFrame({"a": [6, np.nan, 2.5], "b": list("xyz")}); df
     a  b
0  6.0  x
1  NaN  y
2  2.5  z
>>> df.adorn_totals(
...     subset=None,  # or list of index/col names; preferably can take in ranges like `slice("col_a","col_d")` also since `.loc` supports it
...     axis="col",  # index/0/row or column/1/col or both
...     fill_value: str='-',
...     name: str='Total',
... )
         a  b
0      6.0  x
1      NaN  y
2      2.5  z
Total  3.5  -

A few points I disagree(?) with the R implementation:

  • I'm thinking that NaN values will be treated as 0 here by default, so totals won't be affected by presence of NaN -> sum(1, NaN, 2.5) = 3.5. The R janitor function has an na.rm parameter for this, but I somehow feel this isn't necessary.
  • The where parameter, as defined by the R implementation, is to dictate whether to add a Totals "row" or "col"; as opposed to doing the summation over "row"/"col". In the latter case, where="row" would add a new column containing the Totals across the rows (which to me is more natural). I'm calling this parameter axis here btw.

adorn_percentages

TBD. Let me have a little think about this over the weekend, I decided against my own implementation idea while writing out the example API.. ><

Original idea
>>> df = pd.DataFrame({"a": [6, np.nan, 2.5], "b": list("xyz")}); df
     a  b
0  6.0  x
1  NaN  y
2  2.5  z
>>> df.adorn_percentages(
...     subset=None,  # similar to `adorn_totals`
...     axis='col',  # similar to `adorn_totals`
...     adorn_count=True,
...     count_position='front',  # ignored if adorn_count=False
...     count_format=0,  # ignored if adorn_count=False
...     percentage_format=2,
... )
            a  b
0  6 (70.59%)  x
1         nan  y
2   3 (29.4%)  z

Parameters:

  • count_position: whether to do front=="56 (23.4%)", back=="23.4% (56)"
  • count_format / percentage_format: if int, then represents the number of decimal places to round to. otherwise a string format specification like ':,.2f' or whatever.

I'm not that sold on this API yet. Doesn't look too clean / friendly to use. After all, it is an amalgamation of 3 different behaviours in 1 function 😅). Would be happy to hear comments / suggestions to improve, if any.

thatlittleboy avatar Jan 18 '22 16:01 thatlittleboy

@thatlittleboy your thoughts on encapsulation to enforce order sound like the right thing to do.

I'd admit I'm not so well-versed in the adorn_* family of functions in janitor, so I'll hold off on commenting on their specific behaviour. That said, I am in favour of adding in janitor functionality into pyjanitor, and I'm also in favour of your way of thinking about how to organize the functions in a sane fashion too. :smile:

ericmjl avatar Jan 22 '22 10:01 ericmjl

Great, thanks for the affirmation @ericmjl . I'll have a think about the desired API and propose something in a PR when I'm ready. :)

thatlittleboy avatar Jan 26 '22 06:01 thatlittleboy