software-review icon indicating copy to clipboard operation
software-review copied to clipboard

cat2cat: Handling an Inconsistently Coded Categorical Variable in a Panel Dataset

Open Polkas opened this issue 1 year ago • 12 comments

Submitting Author Name: Maciej Nasinski Submitting Author Github Handle: @polkas Other Package Authors Github handles: (comma separated, delete if none) Repository: https://github.com/Polkas/cat2cat Submission type: Pre-submission Language: en


  • Paste the full DESCRIPTION file inside a code block below:
Package: cat2cat
Title: Handling an Inconsistently Coded Categorical Variable in a Panel Dataset
Version: 0.4.5.9000
Authors@R: person("Maciej", "Nasinski", email = "[email protected]", role = c("aut", "cre"))
Maintainer: Maciej Nasinski <[email protected]>
Description: 
  Unifying of an inconsistently coded categorical variable between two different time points in accordance with a mapping table.
  The main rule is to replicate the observation if it could be assign to a few categories.
  Then using simple frequencies or modern statistical methods to approximate probabilities of being assign to each of them.
  This novel procedure was invented and implemented in the paper by (Nasinski, Majchrowska and Broniatowska (2020) <doi:10.24425/cejeme.2020.134747>).
Depends: R (>= 3.6)
License: GPL (>= 2)
URL: https://github.com/Polkas/cat2cat, https://polkas.github.io/cat2cat/
BugReports: https://github.com/Polkas/cat2cat/issues
Encoding: UTF-8
Imports:
    MASS
Suggests:
    caret,
    randomForest,
    knitr,
    rmarkdown,
    pacman,
    testthat (>= 3.0.0),
    magrittr,
    dplyr
LazyData: true
VignetteBuilder: knitr
RoxygenNote: 7.2.1
Config/testthat/edition: 3

Scope

  • Please indicate which category or categories from our package fit policies or statistical package categories this package falls under. (Please check an appropriate box below):

    Data Lifecycle Packages

    • [ ] data retrieval
    • [ ] data extraction
    • [ ] data munging
    • [ ] data deposition
    • [ ] data validation and testing
    • [ ] workflow automation
    • [ ] version control
    • [ ] citation management and bibliometrics
    • [ ] scientific software wrappers
    • [ ] field and lab reproducibility tools
    • [ ] database software bindings
    • [ ] geospatial data
    • [ ] text analysis

    Statistical Packages

    • [ ] Bayesian and Monte Carlo Routines
    • [ ] Dimensionality Reduction, Clustering, and Unsupervised Learning
    • [x] Machine Learning
    • [x] Regression and Supervised Learning
    • [ ] Exploratory Data Analysis (EDA) and Summary Statistics
    • [ ] Spatial Analyses
    • [x] Time Series Analyses
  • Explain how and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of:

The main objective is to unify the inconsistently coded categorical variables in a panel/longitudinal dataset. The supervised methods can be used in the cat2cat procedure. The output from the cat2cat function can be used in the e.g. weighted linear regression or to assess the counts over the time.

I plan to apply it when know if I can submit the package.

  • Who is the target audience and what are scientific applications of this package?

Any scientific field where the panel/longitudinal dataset can be used. Examples of a panel dataset with such inconsistent coded categorical variables are ones linked with the The International Standard Classification of Occupations (ISCO) and the International Classification of Diseases (ICS).

According to best of my knowledge there is no alternative to my solution other than aggregate the datasets (with some simplifications) or remove the variable.

Polkas avatar Dec 03 '22 17:12 Polkas