Pandas-friendly column names
This PR addresses #208 (part of the v0.6.0 Roadmap layed out in #279). It removes characters that restrict column access to the bracket format (df["p_val"]) rather than dot(?) method (df.p_val). It includes the following column-name changes:
- Replaced dashes with underscores in column names (e.g.,
p-val-->p_val) - Replaced parentheses with underscores in column names (e.g.,
mean(A)-->mean_A) - Removed percent signs and brackets in column names (e.g.,
CI95%-->CI95,CI[97.5%]-->CI97.5)
Here is a specific list of most (if not all) of the changes:
p-adjust-->p_adjustp-approx-->p_approxp-corr-->p_corrp-exact-->p_exactp-mid-->p_midp-spher-->p_spherep-tukey-->p_tukeyp-unc-->p_uncp-val-->p_valU-val-->U_valW-spher-->W_spherW-val-->W_valp-GG-corr-->p_GG_corrcohen-d-->cohen_deta-square-->eta_squareodds-ratio-->odds_ratioCI95%-->CI95CI90%-->CI90CI[2.5%]-->CI2.5CI[97.5%]-->CI97.5mean(A)-->mean_Amean(B)-->mean_Bstd(A)-->std_Astd(B)-->std_B
Future column-name considerations
In doing this, I noted some other column-name considerations for future PRs that feel less immediate:
- Standardizing the use of "val". Modules differentially name p-value columns as
p,pval, orp_val. This should be standardized across modules. - Related to (1), the norm for p-values might correspond with other "values", like the current column names
T,F,W,r, etc. I think optimally this would either bepvalandWval/wvalorpandW/w. Perhaps an argument for keeping "val" is that.Tblocks the pandas transpose method. - Standardize some effect size namings across modules. Cohen's d is differentially referred to as
cohenorcohen_d, and eta-squared is differentially referred to aseta_square,eta-squared, orn2. - Standardize the degrees of freedom column names. These are differentially named
DF,dof,ddof. - Standardize abbreviation capitalizations. Bayes factors are
BF, confidence intervals areCI, Wilcoxon stats areW, then there'sdofandr. - Standardize Sentence case vs camelCase vs snake_case. Column capitalization, like
Parametricvsalternativeare inconsistent. I think snake_case is the pandas norm, but consistency seems most important. - I think including the numerical values in confidence interval column names is not necessary. Eg,
CI95could just beCI, and, more importantly,CI97.5andCI2.5could just beCI_upperandCI_lower, respectively.
Notes
- The jupyter notebooks should run fine with the new code, but I did not re-run them, so the output of those tutorials still have the old column names. Let me know if you want me to rerun them before merging.
- I updated the docstrings as needed, but did not build them locally to test. Let me know if you want me to build and inspect before merging.
@remrama this is absolutely fantastic. Thank you for doing such a thorough investigation. My thoughts on what you described in the "Future column-name considerations":
- Overall, I think we should merge the current PR and then implement most, if not all of the changes that you describe, and then release a new 0.6.0 version. The other items in the roadmap feel less important and could be included as part of the 6.x minor version updates.
- Agreed for:
- using
*valthroughout - use
cohen_dandeta_squared(and replacenp2withpartial_eta_squared?) - use
ddofto be consistent with pandas and scipy - use snake_case throughout
- use
ci,ci_upper,ci_lowerthroughout - perhaps even worth adding a new documentation page with Pingouin's standard naming conventions...?
- using
Laslty, yes to both your questions in the "Notes" section. :) I'll do a final review once you have re-generated the notebook. I had a brief look and I think it's good to go. I'm assuming you just did a search and replace of the relevant names in VSCode or similar editor, right?
Great, thanks @raphaelvallat. I think the current PR is now ready for a quick review. I updated the notebooks and tested the docs locally (passed build and visual inspection). I'm a bit confused as to why the Python tests aren't running after a PR (Lint is, but not others). I ran tests locally and they passed, but I can't speak for alternate systems and Python versions.
And yes, correct, I used search/replace within VSCode. It was mostly iterations of regex to find all the things that needed replacing, and then more regex to simplify replacements. There were only a few snags, like the CI column namings and catching all the docstrings.
Happy to push for another PR that adds to the column-name conventions after this one :)
Fix the Github Action CI:
Replace
name: Python tests
on:
push:
branches: [master, develop]
pull_request:
branches: [master, develop]
with:
name: Python tests
on:
push:
branches: [main]
pull_request:
branches: [main]
(Yes, I should have caught this before... 😬 )
I'll wait for the unit tests and then do a final review and approval. Hopefully we don't get failing unit tests from previous PRs 🤦 Feel free to re-enable the CI in a separate PR, which we'll merge before this one
Codecov Report
Attention: Patch coverage is 98.38710% with 1 line in your changes missing coverage. Please review.
Please upload report for BASE (
main@c93ec4b). Learn more about missing BASE report.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| src/pingouin/correlation.py | 80.00% | 1 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #443 +/- ##
=======================================
Coverage ? 98.54%
=======================================
Files ? 19
Lines ? 3360
Branches ? 492
=======================================
Hits ? 3311
Misses ? 26
Partials ? 23
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@raphaelvallat all checks passed here, you should be good to review/merge. We should be good to go for the immediate column-name fixes, then we can set up a separate PR for more convention-specific namings.
I had a last-minute thought about this PR: It really doesn't impact at all how the user interfaces with Pingouin (only the output). But there is one exception with the compute_effsize function, where a few parameters won't work anymore. Passing in eta-square and odds-ratio now will not work, user needs to pass eta_square or odds_ratio. Maybe just leave it, but I was thinking we could add a simple "-".replace("_") and a FutureWarning. The only reason these were changed is because they end up as column names in some functions, so the other option would be to just name the columns something like effsize and not carry the actual stat name over.
@raphaelvallat this PR (and #446) have been approved for a bit, I think they can be merged? Lmk if I am missing something I need to do. Otherwise I just want to make sure they don't fall behind and end up with conflicts.
Thanks for the reminder! Merging now!