software-review cat2cat: Handling an Inconsistently Coded Categorical Variable in a Panel Dataset

Submitting Author Name: Maciej Nasinski Submitting Author Github Handle: @polkas Other Package Authors Github handles: (comma separated, delete if none) Repository: https://github.com/Polkas/cat2cat Submission type: Pre-submission Language: en

Paste the full DESCRIPTION file inside a code block below:

Package: cat2cat
Title: Handling an Inconsistently Coded Categorical Variable in a Panel Dataset
Version: 0.4.5.9000
Authors@R: person("Maciej", "Nasinski", email = "[email protected]", role = c("aut", "cre"))
Maintainer: Maciej Nasinski <[email protected]>
Description: 
  Unifying of an inconsistently coded categorical variable between two different time points in accordance with a mapping table.
  The main rule is to replicate the observation if it could be assign to a few categories.
  Then using simple frequencies or modern statistical methods to approximate probabilities of being assign to each of them.
  This novel procedure was invented and implemented in the paper by (Nasinski, Majchrowska and Broniatowska (2020) <doi:10.24425/cejeme.2020.134747>).
Depends: R (>= 3.6)
License: GPL (>= 2)
URL: https://github.com/Polkas/cat2cat, https://polkas.github.io/cat2cat/
BugReports: https://github.com/Polkas/cat2cat/issues
Encoding: UTF-8
Imports:
    MASS
Suggests:
    caret,
    randomForest,
    knitr,
    rmarkdown,
    pacman,
    testthat (>= 3.0.0),
    magrittr,
    dplyr
LazyData: true
VignetteBuilder: knitr
RoxygenNote: 7.2.1
Config/testthat/edition: 3

Scope

Please indicate which category or categories from our package fit policies or statistical package categories this package falls under. (Please check an appropriate box below):

Data Lifecycle Packages
- [ ] data retrieval
- [ ] data extraction
- [ ] data munging
- [ ] data deposition
- [ ] data validation and testing
- [ ] workflow automation
- [ ] version control
- [ ] citation management and bibliometrics
- [ ] scientific software wrappers
- [ ] field and lab reproducibility tools
- [ ] database software bindings
- [ ] geospatial data
- [ ] text analysis
Statistical Packages
- [ ] Bayesian and Monte Carlo Routines
- [ ] Dimensionality Reduction, Clustering, and Unsupervised Learning
- [x] Machine Learning
- [x] Regression and Supervised Learning
- [ ] Exploratory Data Analysis (EDA) and Summary Statistics
- [ ] Spatial Analyses
- [x] Time Series Analyses
Explain how and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of:

The main objective is to unify the inconsistently coded categorical variables in a panel/longitudinal dataset. The supervised methods can be used in the cat2cat procedure. The output from the cat2cat function can be used in the e.g. weighted linear regression or to assess the counts over the time.

If submitting a statistical package, have you already incorporated documentation of standards into your code via the srr package?

I plan to apply it when know if I can submit the package.

Who is the target audience and what are scientific applications of this package?

Any scientific field where the panel/longitudinal dataset can be used. Examples of a panel dataset with such inconsistent coded categorical variables are ones linked with the The International Standard Classification of Occupations (ISCO) and the International Classification of Diseases (ICS).

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

According to best of my knowledge there is no alternative to my solution other than aggregate the datasets (with some simplifications) or remove the variable.

(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?
Any other questions or issues we should be aware of?:

Dec 03 '22 17:12 Polkas

Thanks for the pre-submission enquiry @Polkas !

The editorial team is discussing and we'll get back to you shortly.

Dec 05 '22 10:12 annakrystalli

Dear @Polkas,

The editorial team has concluded that the package definitely fits in our "stats" scope.

Before proceeding and closing this pre-sub enquiry, there is also a need to clarify what category it would fit. The stats-devguide states categories are appropriate where at least half of all standards can be applied. We suggest you need to try and narrow down to one category only.

We feel it does not best fit the "time series" category and seems initially to most likely be "Machine Learning," We suggest you spend a little time to read though the standards and consider which you would think most appropriate.

Following that, the best way to confirm would be to go through the formal process of documenting compliance with the stats standards, which needs to be done prior to submission anyways. You can call @ropensci-review-bot check srr in this issue to confirm documentation has been completed successfully. You can find more details in our documentation.

Just ping me here to confirm that's done and the category you have narrowed it down too or if you need any help.

Thanks again for your enquiry!

Dec 06 '22 09:12 annakrystalli

Dear @Polkas,

Today starts my rotation as EiC meaning the role of @annakrystalli is now mine. Did you have the chance to follow up on the comment above?

Feb 01 '23 14:02 maurolepore

:wave: @Polkas! I'm now the current editor in chief Any update? :smile_cat:

May 02 '23 10:05 maelle

@Polkas friendly reminder, did you get a chance to work on the comments from https://github.com/ropensci/software-review/issues/562#issuecomment-1339049291?

May 23 '23 07:05 maelle

Hey, thank you for your update. I already assessed what category and scope is possible for my package. I found out that the base requirements are possible to be followed. I am limited with any decision to update my package now as I submited my paper to SoftwareX journal and waiting for their decision and comments.

May 23 '23 14:05 Polkas

@Polkas any update? :smile_cat:

Jul 21 '23 10:07 maelle

Hey, my paper was just published. I will start to work on the new feature branch for possible ropensci submission. I will give here a follow-up. Have a great day.

Sep 17 '23 13:09 Polkas

@Polkas I am currently serving as the EIC and am checking in on some older submissions. First, congrats on the publication! You mentioned that you might pursue another submission to rOpenSci. Have you decided to move forward with that?

Dec 15 '23 12:12 jhollist

Hey @jhollist, thank you for your response. I have dedicated effort to align with the expected standards. However, it appears that the current focus of rOpenSci may have shifted away from packages similar to mine.

I understand that rOpenSci is now prioritizing support for packages that facilitate reproducible research and manage the data lifecycle for scientists. I have thoroughly reviewed the current package categories and, unfortunately, it seems my package may not align with any of these categories.

If my understanding is correct and my package indeed falls outside the scope of rOpenSci's current focus, please feel free to close this issue.

Dec 17 '23 20:12 Polkas

@Polkas your package is a better fit for our Statistical Software. Based on the conversation above (https://github.com/ropensci/software-review/issues/562#issuecomment-1339049291), take a close look at https://stats-devguide.ropensci.org/pkgdev.html#scope and see if you think any of those fit. The prior conversations on here and amongst the editors felt like Machine Learning might be the best fit. If you would like to proceed take a close look at the Stats devguide. If you have specific questions after that, you can ping me again here. Thanks!

Dec 18 '23 15:12 jhollist

Hi @Polkas ! I'm checking in on submissions that have been sitting for awhile. It sounds like the feedback has been that this package would be better suited for the rOpenSci Statistical Software submission. The process is similar, but there are a few differences. I'll once again plug the statistical submission guide:

https://stats-devguide.ropensci.org/

Let me know if you have any questions.

Feb 28 '24 16:02 ldecicco-USGS

software-review software-review copied to clipboard

cat2cat: Handling an Inconsistently Coded Categorical Variable in a Panel Dataset

Scope

software-review
software-review copied to clipboard