software-review icon indicating copy to clipboard operation
software-review copied to clipboard

leakr: detect and diagnose data leakage in machine learning workflows

Open cherylisabella opened this issue 1 month ago • 11 comments

Submitting Author Name: Cheryl Isabella Submitting Author Github Handle: @cherylisabella Repository: https://github.com/cherylisabella/leakr Version submitted: 0.1.0 Submission type: Standard Editor: TBD Reviewers: TBD

Archive: TBD Version accepted: TBD Language: en


  • Paste the full DESCRIPTION file inside a code block below:
Package: leakr
Type: Package
Title: Data Leakage Detection Tools for Machine Learning
Version: 0.1.0
Authors@R: person(given = c("Cheryl", "Isabella"), family = "Lim", role = c("aut", "cre"), email =
           "[email protected]")
Description: Provides utilities to detect common data leakage patterns including train/test
           contamination, temporal leakage, and data duplication, enhancing model reliability and
           reproducibility in machine learning workflows. Generates diagnostic reports and visual
           summaries to support data validation. Methods based on best practices from Hastie,
           Tibshirani, and Friedman (2009, ISBN:978-0387848570).
Imports: ggplot2, arrow, data.table, digest, htmltools, openxlsx, readxl, stringr, workflows, jsonlite
Suggests: testthat (>= 3.0.0), caret, mlr3, tidymodels, knitr, rmarkdown
License: MIT + file LICENSE
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
LazyData: false
RoxygenNote: 7.3.3
VignetteBuilder: knitr

Scope

  • Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):

    • [ ] data retrieval
    • [ ] data extraction
    • [ ] data munging
    • [ ] data deposition
    • [x] data validation and testing
    • [x] workflow automation
    • [ ] version control
    • [ ] citation management and bibliometrics
    • [ ] scientific software wrappers
    • [ ] field and lab reproducibility tools
    • [ ] database software bindings
    • [ ] geospatial data
    • [ ] translation

Statistical Packages

  • [ ] Bayesian and Monte Carlo Routines

  • [ ] Dimensionality Reduction, Clustering, and Unsupervised Learning

  • [x] Machine Learning

  • [ ] Regression and Supervised Learning

  • [x] Exploratory Data Analysis (EDA) and Summary Statistics

  • [ ] Spatial Analyses

  • [ ] Time Series Analyses

  • [ ] Probability Distributions

  • Explain how and why the package falls under these categories (briefly, 1-2 sentences): leakr provides utilities to automatically detect common data‑leakage patterns (train/test contamination, target leakage, duplicate/near‑duplicate rows, temporal/data‑split leakage) in tabular data workflows. It generates diagnostic reports and visualizations to help users identify, evaluate, and correct leakage before model training.

  • Who is the target audience and what are scientific applications of this package? Data scientists, statisticians, machine‑learning practitioners and researchers working with predictive models on tabular data. It is especially useful for anyone interested in ensuring model validity, preventing overfitting due to data leakage, maintaining reproducibility, and auditing machine learning workflows like in social sciences, epidemiology, economics, or any data‑driven research with predictive modeling.

  • Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category? No. leakr addresses a very real problem in reproducible machine learning and data science workflows: data leakage. By providing a standardised toolkit to audit datasets and detect leakage early, it increases the reliability and transparency of analyses which aligns well with rOpenSci’s mission of promoting reproducible, open data science. leakr is general (not tied to a single domain) and hence useful to researchers across fields like social science, epidemiology, economics, and ML.

  • (If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research? N/A.

  • If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

  • Explain reasons for any pkgcheck items which your package is unable to pass.

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

  • [x] Do you intend for this package to go on CRAN?

  • [ ] Do you intend for this package to go on Bioconductor?

  • [ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options
  • [ ] The package is novel and will be of interest to the broad readership of the journal.
  • [ ] The manuscript describing the package is no longer than 3000 words.
  • [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
  • (Scope: Do consider MEE's Aims and Scope for your manuscript. We make no guarantee that your manuscript will be within MEE scope.)
  • (Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
  • (Please do not submit your package separately to Methods in Ecology and Evolution)

Code of conduct

  • [x] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

cherylisabella avatar Dec 07 '25 17:12 cherylisabella

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

ropensci-review-bot avatar Dec 07 '25 17:12 ropensci-review-bot

The submission template is missing the following values: author1, repourl

ropensci-review-bot avatar Dec 07 '25 17:12 ropensci-review-bot

:rocket:

Error: Issue template has no 'repourl'

:wave:

ropensci-review-bot avatar Dec 07 '25 17:12 ropensci-review-bot

@ropensci-review-bot check package

cherylisabella avatar Dec 08 '25 10:12 cherylisabella

I'm sorry @cherylisabella, I'm afraid I can't do that. That's something only editors, author1, author-others and reviewers-list are allowed to do.

ropensci-review-bot avatar Dec 08 '25 10:12 ropensci-review-bot

@ropensci-review-bot check package

mpadge avatar Dec 08 '25 11:12 mpadge

Thanks, about to send the query.

ropensci-review-bot avatar Dec 08 '25 11:12 ropensci-review-bot

:rocket:

Editor check started

:wave:

ropensci-review-bot avatar Dec 08 '25 11:12 ropensci-review-bot

Checks for leakr (v0.1.0)

git hash: 51d11185

  • :heavy_check_mark: Package is already on CRAN.
  • :heavy_multiplication_x: does not have a 'codemeta.json' file.
  • :heavy_check_mark: has a 'contributing' file.
  • :heavy_check_mark: uses 'roxygen2'.
  • :heavy_multiplication_x: 'DESCRIPTION' does not have a URL field.
  • :heavy_multiplication_x: 'DESCRIPTION' does not have a BugReports field.
  • :heavy_check_mark: Package has at least one HTML vignette
  • :heavy_multiplication_x: These functions do not have examples: [compile_report, format_detector_name, grapes-or-or-grapes, leakr_create_snapshot, leakr_export_data, leakr_from_caret, leakr_from_mlr3, leakr_from_tidymodels, leakr_import, leakr_list_snapshots, leakr_load_snapshot, leakr_plot, leakr_quick_import, new_temporal_detector, new_train_test_detector, plot.detector_result, plot.udld_report, register_detector, run_detector].
  • :heavy_multiplication_x: Continuous integration checks unavailable (no URL in 'DESCRIPTION').
  • :heavy_multiplication_x: Package coverage is 10.9% (should be at least 75%).
  • :heavy_check_mark: R CMD check found no errors.
  • :heavy_check_mark: R CMD check found no warnings.
  • :eyes: Some goodpractice linters failed.
  • :eyes: Function names are duplicated in other packages

Important: All failing checks above must be addressed prior to proceeding

(Checks marked with :eyes: may be optionally addressed.)

Package License: MIT + file LICENSE


1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type package ncalls
internal base 1086
internal utils 165
internal leakr 124
internal stats 21
internal graphics 7
internal methods 3
internal grDevices 2
internal tools 1
imports ggplot2 26
imports jsonlite 7
imports digest 5
imports readxl 2
imports workflows 2
imports arrow 1
imports data.table 1
imports htmltools NA
imports openxlsx NA
imports stringr NA
suggests testthat NA
suggests caret NA
suggests mlr3 NA
suggests tidymodels NA
suggests knitr NA
suggests rmarkdown NA
linking_to NA NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

list (286), return (91), length (40), data.frame (39), for (37), nrow (33), c (32), is.na (29), if (28), names (28), sum (26), drop (21), sapply (18), split (18), format (17), sprintf (17), as.character (16), vapply (13), class (12), table (12), unique (12), which (11), max (10), as.numeric (9), min (9), character (8), seq_len (8), apply (7), det (7), numeric (7), file.path (6), ifelse (6), levels (6), mean (6), ncol (6), structure (6), Sys.time (6), abs (5), lapply (5), switch (5), unlist (5), as.Date (4), assign (4), logical (4), sample (4), source (4), sqrt (4), as.factor (3), as.matrix (3), dim (3), factor (3), intersect (3), paste (3), pretty (3), rbind (3), with (3), all (2), args (2), do.call (2), emptyenv (2), expand.grid (2), file (2), file.exists (2), file.info (2), grepl (2), inherits (2), is.factor (2), is.numeric (2), lengths (2), new.env (2), paste0 (2), range (2), rep (2), rowSums (2), setdiff (2), tempdir (2), tolower (2), which.max (2), as.POSIXct (1), asNamespace (1), attr (1), basename (1), col (1), colSums (1), diff (1), dirname (1), floor (1), get (1), is.character (1), is.finite (1), is.list (1), list.dirs (1), log2 (1), ls (1), mode (1), R.version.string (1), readLines (1), readRDS (1), round (1), scale (1), seq_along (1), sort (1), sub (1), suppressWarnings (1), union (1), vector (1)

utils

data (153), stack (3), head (2), modifyList (2), de (1), packageVersion (1), read.csv (1), read.delim (1), timestamp (1)

leakr

leakr_import (3), new_temporal_detector (3), analyse_cluster_duplicates (2), analyse_temporal_target_relationship (2), calculate_cluster_similarity (2), calculate_feature_importance (2), calculate_mutual_information (2), calculate_pairwise_similarity (2), calculate_row_similarity (2), clean_column_names (2), cluster_similar_pairs (2), compile_report (2), detect_aggregation_leakage (2), detect_cluster_based_duplicates (2), detect_correlation_leakage (2), detect_exact_duplicates (2), detect_feature_importance_leakage (2), detect_id_duplicates (2), detect_near_duplicates (2), detect_perfect_separation (2), detect_subset_duplicates (2), detect_temporal_target_leakage (2), determine_correlation_severity (2), determine_duplication_severity (2), determine_risk_level (2), export_data_internal (2), find_subset_relationships (2), find_time_columns (2), format_evidence (2), generate_recommendations (2), list_registered_detectors (2), new_detector (2), new_train_test_detector (2), perform_duplicate_clustering (2), prepare_audit_data (2), process_duplication (2), run_detector (2), run_detector.temporal_detector (2), analyse_target_distribution (1), create_detector (1), detect_and_convert_dates_enhanced (1), detect_duplication (1), detect_file_format (1), detect_target_leakage (1), detect_train_test_contamination (1), empty_snapshot_info (1), format_detector_name (1), format_lines (1), generate_diagnostic_plots (1), generate_evidence_section (1), generate_executive_summary_text (1), generate_issues_section (1), generate_recommendations_section (1), get_detector (1), get_detector_info (1), import_csv (1), import_excel (1), import_json (1), import_parquet (1), import_rds (1), import_tsv (1), is_subset_row (1), leakr_audit (1), leakr_create_snapshot (1), leakr_export_data (1), leakr_from_caret (1), leakr_from_mlr3 (1), leakr_from_tidymodels (1), leakr_list_snapshots (1), leakr_load_snapshot (1), leakr_plot (1), leakr_quick_import (1), leakr_summarise (1), plot.detector_result (1), plot.udld_report (1), preprocess_imported_data (1), print.leakr_detector (1), print.leakr_report (1), register_detector (1), run_detector.default (1), run_detectors (1), stratified_sample (1), test_aggregation_pattern (1), test_perfect_separation (1)

ggplot2

ggplot (6), aes (5), theme (4), element_text (3), geom_bar (3), labs (3), theme_minimal (2)

stats

df (6), lm (3), sd (3), complete.cases (2), kmeans (2), median (2), chisq.test (1), cor.test (1), quantile (1)

graphics

lines (3), title (3), text (1)

jsonlite

fromJSON (4), toJSON (3)

digest

digest (5)

methods

is (3)

grDevices

palette (2)

readxl

excel_sheets (1), read_excel (1)

workflows

pull_workflow_preprocessor (1), pull_workflow_spec (1)

arrow

read_parquet (1)

data.table

fread (1)

tools

file_ext (1)

NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.


2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

  • code in R (100% in 14 files) and
  • 1 authors
  • 3 vignettes
  • no internal data file
  • 10 imported packages
  • 50 exported functions (median 21 lines of code)
  • 111 non-exported functions in R (median 33 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used:

  • loc = "Lines of Code"
  • fn = "function"
  • exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure value percentile noteworthy
files_R 11 60.1
files_inst 4 96.3
files_vignettes 3 89.3
files_tests 6 77.8
loc_R 1521 75.5
loc_inst 297 57.9
loc_vignettes 713 83.9
loc_tests 141 42.4
num_vignettes 3 90.9
n_fns_r 161 84.4
n_fns_r_exported 50 87.1
n_fns_r_not_exported 111 83.4
n_fns_per_file_r 6 77.5
num_params_per_fn 2 9.4
loc_per_fn_r 29 75.5
loc_per_fn_r_exp 22 50.7
loc_per_fn_r_not_exp 33 81.3
rel_whitespace_R 14 69.1
rel_whitespace_inst 19 58.3
rel_whitespace_vignettes 36 87.2
rel_whitespace_tests 22 42.0
doclines_per_fn_exp 12 5.4
doclines_per_fn_not_exp 0 0.0 TRUE
fn_call_network_size 82 74.1

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


3. goodpractice and other checks

Details of goodpractice checks (click to open)


3b. goodpractice results

R CMD check with rcmdcheck

R CMD check generated the following notes:

  1. checking for hidden files and directories ... NOTE Found the following hidden files and directories: .github These were most likely included in error. See section ‘Package structure’ in the ‘Writing R Extensions’ manual.
  2. checking DESCRIPTION meta-information ... NOTE License stub is invalid DCF.

R CMD check generated the following check_fails:

  1. description_url
  2. description_bugreports
  3. rcmdcheck_hidden_files_and_directories

Test coverage with covr

Package coverage: 10.9

The following files are not completely covered by tests:

file coverage
R/core.R 0%
R/io.R 0%
R/leakR.R 0%
R/pkg-detector.R 0%
R/plot.R 0%
R/report.R 0%
R/viz.R 0%
R/zzz.R 0%

Cyclocomplexity with cyclocomp

The following functions have cyclocomplexity >= 15:

function cyclocomplexity
preprocess_imported_data 31
detect_and_convert_dates_enhanced 30
run_detector.train_test_detector 23
run_detectors 22
prepare_audit_data 21
run_detector.temporal_detector 20

Static code analyses with lintr

lintr found no issues with this package!


4. Other Checks

Details of other checks (click to open)

:heavy_multiplication_x: The following function name is duplicated in other packages:

    • %||% from infix, hset, formatters, fuj, arkhe, iNZightTools, arcgisutils, examly, powerbrmsINLA, rlang

Package Versions

package version
pkgstats 0.2.0.93
pkgcheck 0.1.2.241

Editor-in-Chief Instructions:

Processing may not proceed until the items marked with :heavy_multiplication_x: have been resolved.

ropensci-review-bot avatar Dec 08 '25 11:12 ropensci-review-bot

@cherylisabella That shows a bit of work for you to address before we proceed. It'll likely help you to use https://github.com/ropensci-review-tools/pkgcheck-action to generate the same pkgcheck report in your own repos on each push. Once you're getting the all-clear there, feel free to ask the bot to check package here to confirm. Thanks!

mpadge avatar Dec 08 '25 12:12 mpadge

@mpadge thank you again! I'll get started on those now :)

cherylisabella avatar Dec 08 '25 13:12 cherylisabella

Please see details for closing at https://github.com/ropensci/software-review/issues/733#issuecomment-3665652325

jhollist avatar Dec 17 '25 14:12 jhollist