pingouin Pandas-friendly column names

This PR addresses #208 (part of the v0.6.0 Roadmap layed out in #279). It removes characters that restrict column access to the bracket format (df["p_val"]) rather than dot(?) method (df.p_val). It includes the following column-name changes:

Replaced dashes with underscores in column names (e.g., p-val --> p_val)
Replaced parentheses with underscores in column names (e.g., mean(A) --> mean_A)
Removed percent signs and brackets in column names (e.g., CI95% --> CI95, CI[97.5%] --> CI97.5)

Here is a specific list of most (if not all) of the changes:

p-adjust --> p_adjust
p-approx --> p_approx
p-corr --> p_corr
p-exact --> p_exact
p-mid --> p_mid
p-spher --> p_sphere
p-tukey --> p_tukey
p-unc --> p_unc
p-val --> p_val
U-val --> U_val
W-spher --> W_spher
W-val --> W_val
p-GG-corr --> p_GG_corr
cohen-d --> cohen_d
eta-square --> eta_square
odds-ratio --> odds_ratio
CI95% --> CI95
CI90% --> CI90
CI[2.5%] --> CI2.5
CI[97.5%] --> CI97.5
mean(A) --> mean_A
mean(B) --> mean_B
std(A) --> std_A
std(B) --> std_B

Future column-name considerations

In doing this, I noted some other column-name considerations for future PRs that feel less immediate:

Standardizing the use of "val". Modules differentially name p-value columns as p, pval, or p_val. This should be standardized across modules.
Related to (1), the norm for p-values might correspond with other "values", like the current column names T, F, W, r, etc. I think optimally this would either be pval and Wval/wval or p and W/w. Perhaps an argument for keeping "val" is that .T blocks the pandas transpose method.
Standardize some effect size namings across modules. Cohen's d is differentially referred to as cohen or cohen_d, and eta-squared is differentially referred to as eta_square, eta-squared, or n2.
Standardize the degrees of freedom column names. These are differentially named DF, dof, ddof.
Standardize abbreviation capitalizations. Bayes factors are BF, confidence intervals are CI, Wilcoxon stats are W, then there's dof and r.
Standardize Sentence case vs camelCase vs snake_case. Column capitalization, like Parametric vs alternative are inconsistent. I think snake_case is the pandas norm, but consistency seems most important.
I think including the numerical values in confidence interval column names is not necessary. Eg, CI95 could just be CI, and, more importantly, CI97.5 and CI2.5 could just be CI_upper and CI_lower, respectively.

Notes

The jupyter notebooks should run fine with the new code, but I did not re-run them, so the output of those tutorials still have the old column names. Let me know if you want me to rerun them before merging.
I updated the docstrings as needed, but did not build them locally to test. Let me know if you want me to build and inspect before merging.

Oct 07 '24 20:10 remrama

@remrama this is absolutely fantastic. Thank you for doing such a thorough investigation. My thoughts on what you described in the "Future column-name considerations":

Overall, I think we should merge the current PR and then implement most, if not all of the changes that you describe, and then release a new 0.6.0 version. The other items in the roadmap feel less important and could be included as part of the 6.x minor version updates.
Agreed for:
- using *val throughout
- use cohen_d and eta_squared (and replace np2 with partial_eta_squared?)
- use ddof to be consistent with pandas and scipy
- use snake_case throughout
- use ci, ci_upper, ci_lower throughout
- perhaps even worth adding a new documentation page with Pingouin's standard naming conventions...?

Laslty, yes to both your questions in the "Notes" section. :) I'll do a final review once you have re-generated the notebook. I had a brief look and I think it's good to go. I'm assuming you just did a search and replace of the relevant names in VSCode or similar editor, right?

Oct 08 '24 06:10 raphaelvallat

Great, thanks @raphaelvallat. I think the current PR is now ready for a quick review. I updated the notebooks and tested the docs locally (passed build and visual inspection). I'm a bit confused as to why the Python tests aren't running after a PR (Lint is, but not others). I ran tests locally and they passed, but I can't speak for alternate systems and Python versions.

And yes, correct, I used search/replace within VSCode. It was mostly iterations of regex to find all the things that needed replacing, and then more regex to simplify replacements. There were only a few snags, like the CI column namings and catching all the docstrings.

Happy to push for another PR that adds to the column-name conventions after this one :)

Oct 08 '24 14:10 remrama

Fix the Github Action CI:

Replace

name: Python tests

on:
  push:
    branches: [master, develop]
  pull_request:
    branches: [master, develop]

with:

name: Python tests

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

(Yes, I should have caught this before... 😬 )

Oct 08 '24 15:10 raphaelvallat

I'll wait for the unit tests and then do a final review and approval. Hopefully we don't get failing unit tests from previous PRs 🤦 Feel free to re-enable the CI in a separate PR, which we'll merge before this one

Oct 08 '24 15:10 raphaelvallat

Codecov Report

Attention: Patch coverage is 98.38710% with 1 line in your changes missing coverage. Please review.

Please upload report for BASE (main@c93ec4b). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
src/pingouin/correlation.py	80.00%	1 Missing :warning:

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #443   +/-   ##
=======================================
  Coverage        ?   98.54%           
=======================================
  Files           ?       19           
  Lines           ?     3360           
  Branches        ?      492           
=======================================
  Hits            ?     3311           
  Misses          ?       26           
  Partials        ?       23

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Oct 09 '24 14:10 codecov[bot]

@raphaelvallat all checks passed here, you should be good to review/merge. We should be good to go for the immediate column-name fixes, then we can set up a separate PR for more convention-specific namings.

I had a last-minute thought about this PR: It really doesn't impact at all how the user interfaces with Pingouin (only the output). But there is one exception with the compute_effsize function, where a few parameters won't work anymore. Passing in eta-square and odds-ratio now will not work, user needs to pass eta_square or odds_ratio. Maybe just leave it, but I was thinking we could add a simple "-".replace("_") and a FutureWarning. The only reason these were changed is because they end up as column names in some functions, so the other option would be to just name the columns something like effsize and not carry the actual stat name over.

Oct 09 '24 14:10 remrama

@raphaelvallat this PR (and #446) have been approved for a bit, I think they can be merged? Lmk if I am missing something I need to do. Otherwise I just want to make sure they don't fall behind and end up with conflicts.

Dec 06 '24 22:12 remrama

Thanks for the reminder! Merging now!

Dec 08 '24 12:12 raphaelvallat