software-review leakr: detect and diagnose data leakage in machine learning workflows

Submitting Author Name: Cheryl Isabella Submitting Author Github Handle: @cherylisabella Repository: https://github.com/cherylisabella/leakr Version submitted: 0.1.0 Submission type: Standard Editor: TBD Reviewers: TBD

Archive: TBD Version accepted: TBD Language: en

Paste the full DESCRIPTION file inside a code block below:

Package: leakr
Type: Package
Title: Data Leakage Detection Tools for Machine Learning
Version: 0.1.0
Authors@R: person(given = c("Cheryl", "Isabella"), family = "Lim", role = c("aut", "cre"), email =
           "[email protected]")
Description: Provides utilities to detect common data leakage patterns including train/test
           contamination, temporal leakage, and data duplication, enhancing model reliability and
           reproducibility in machine learning workflows. Generates diagnostic reports and visual
           summaries to support data validation. Methods based on best practices from Hastie,
           Tibshirani, and Friedman (2009, ISBN:978-0387848570).
Imports: ggplot2, arrow, data.table, digest, htmltools, openxlsx, readxl, stringr, workflows, jsonlite
Suggests: testthat (>= 3.0.0), caret, mlr3, tidymodels, knitr, rmarkdown
License: MIT + file LICENSE
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
LazyData: false
RoxygenNote: 7.3.3
VignetteBuilder: knitr

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- [ ] data retrieval
- [ ] data extraction
- [ ] data munging
- [ ] data deposition
- [x] data validation and testing
- [x] workflow automation
- [ ] version control
- [ ] citation management and bibliometrics
- [ ] scientific software wrappers
- [ ] field and lab reproducibility tools
- [ ] database software bindings
- [ ] geospatial data
- [ ] translation

Statistical Packages

[ ] Bayesian and Monte Carlo Routines
[ ] Dimensionality Reduction, Clustering, and Unsupervised Learning
[x] Machine Learning
[ ] Regression and Supervised Learning
[x] Exploratory Data Analysis (EDA) and Summary Statistics
[ ] Spatial Analyses
[ ] Time Series Analyses
[ ] Probability Distributions
Explain how and why the package falls under these categories (briefly, 1-2 sentences): leakr provides utilities to automatically detect common data‑leakage patterns (train/test contamination, target leakage, duplicate/near‑duplicate rows, temporal/data‑split leakage) in tabular data workflows. It generates diagnostic reports and visualizations to help users identify, evaluate, and correct leakage before model training.
Who is the target audience and what are scientific applications of this package? Data scientists, statisticians, machine‑learning practitioners and researchers working with predictive models on tabular data. It is especially useful for anyone interested in ensuring model validity, preventing overfitting due to data leakage, maintaining reproducibility, and auditing machine learning workflows like in social sciences, epidemiology, economics, or any data‑driven research with predictive modeling.
Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category? No. leakr addresses a very real problem in reproducible machine learning and data science workflows: data leakage. By providing a standardised toolkit to audit datasets and detect leakage early, it increases the reliability and transparency of analyses which aligns well with rOpenSci’s mission of promoting reproducible, open data science. leakr is general (not tied to a single domain) and hence useful to researchers across fields like social science, epidemiology, economics, and ML.
(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research? N/A.
If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.
Explain reasons for any pkgcheck items which your package is unable to pass.

Technical checks

Confirm each of the following by checking the box.

[x] I have read the rOpenSci packaging guide.
[x] I have read the author guide and I expect to maintain this package for at least 2 years or to find a replacement.

This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has a CRAN and OSI accepted license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions, created with roxygen2.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, including reporting of test coverage.

Publication options

[x] Do you intend for this package to go on CRAN?
[ ] Do you intend for this package to go on Bioconductor?
[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options

[ ] The package is novel and will be of interest to the broad readership of the journal.
[ ] The manuscript describing the package is no longer than 3000 words.
[ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
(Scope: Do consider MEE's Aims and Scope for your manuscript. We make no guarantee that your manuscript will be within MEE scope.)
(Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
(Please do not submit your package separately to Methods in Ecology and Evolution)

Code of conduct

[x] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Dec 07 '25 17:12 cherylisabella

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

Dec 07 '25 17:12 ropensci-review-bot

The submission template is missing the following values: author1, repourl

Dec 07 '25 17:12 ropensci-review-bot

:rocket:

Error: Issue template has no 'repourl'

:wave:

Dec 07 '25 17:12 ropensci-review-bot

@ropensci-review-bot check package

Dec 08 '25 10:12 cherylisabella

I'm sorry @cherylisabella, I'm afraid I can't do that. That's something only editors, author1, author-others and reviewers-list are allowed to do.

Dec 08 '25 10:12 ropensci-review-bot

@ropensci-review-bot check package

Dec 08 '25 11:12 mpadge

Thanks, about to send the query.

Dec 08 '25 11:12 ropensci-review-bot

:rocket:

Editor check started

:wave:

Dec 08 '25 11:12 ropensci-review-bot

Checks for leakr (v0.1.0)

git hash: 51d11185

:heavy_check_mark: Package is already on CRAN.
:heavy_multiplication_x: does not have a 'codemeta.json' file.
:heavy_check_mark: has a 'contributing' file.
:heavy_check_mark: uses 'roxygen2'.
:heavy_multiplication_x: 'DESCRIPTION' does not have a URL field.
:heavy_multiplication_x: 'DESCRIPTION' does not have a BugReports field.
:heavy_check_mark: Package has at least one HTML vignette
:heavy_multiplication_x: These functions do not have examples: [compile_report, format_detector_name, grapes-or-or-grapes, leakr_create_snapshot, leakr_export_data, leakr_from_caret, leakr_from_mlr3, leakr_from_tidymodels, leakr_import, leakr_list_snapshots, leakr_load_snapshot, leakr_plot, leakr_quick_import, new_temporal_detector, new_train_test_detector, plot.detector_result, plot.udld_report, register_detector, run_detector].
:heavy_multiplication_x: Continuous integration checks unavailable (no URL in 'DESCRIPTION').
:heavy_multiplication_x: Package coverage is 10.9% (should be at least 75%).
:heavy_check_mark: R CMD check found no errors.
:heavy_check_mark: R CMD check found no warnings.
:eyes: Some goodpractice linters failed.
:eyes: Function names are duplicated in other packages

Important: All failing checks above must be addressed prior to proceeding

(Checks marked with :eyes: may be optionally addressed.)

Package License: MIT + file LICENSE

1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type	package	ncalls
internal	base	1086
internal	utils	165
internal	leakr	124
internal	stats	21
internal	graphics	7
internal	methods	3
internal	grDevices	2
internal	tools	1
imports	ggplot2	26
imports	jsonlite	7
imports	digest	5
imports	readxl	2
imports	workflows	2
imports	arrow	1
imports	data.table	1
imports	htmltools	NA
imports	openxlsx	NA
imports	stringr	NA
suggests	testthat	NA
suggests	caret	NA
suggests	mlr3	NA
suggests	tidymodels	NA
suggests	knitr	NA
suggests	rmarkdown	NA
linking_to	NA	NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

list (286), return (91), length (40), data.frame (39), for (37), nrow (33), c (32), is.na (29), if (28), names (28), sum (26), drop (21), sapply (18), split (18), format (17), sprintf (17), as.character (16), vapply (13), class (12), table (12), unique (12), which (11), max (10), as.numeric (9), min (9), character (8), seq_len (8), apply (7), det (7), numeric (7), file.path (6), ifelse (6), levels (6), mean (6), ncol (6), structure (6), Sys.time (6), abs (5), lapply (5), switch (5), unlist (5), as.Date (4), assign (4), logical (4), sample (4), source (4), sqrt (4), as.factor (3), as.matrix (3), dim (3), factor (3), intersect (3), paste (3), pretty (3), rbind (3), with (3), all (2), args (2), do.call (2), emptyenv (2), expand.grid (2), file (2), file.exists (2), file.info (2), grepl (2), inherits (2), is.factor (2), is.numeric (2), lengths (2), new.env (2), paste0 (2), range (2), rep (2), rowSums (2), setdiff (2), tempdir (2), tolower (2), which.max (2), as.POSIXct (1), asNamespace (1), attr (1), basename (1), col (1), colSums (1), diff (1), dirname (1), floor (1), get (1), is.character (1), is.finite (1), is.list (1), list.dirs (1), log2 (1), ls (1), mode (1), R.version.string (1), readLines (1), readRDS (1), round (1), scale (1), seq_along (1), sort (1), sub (1), suppressWarnings (1), union (1), vector (1)

utils

data (153), stack (3), head (2), modifyList (2), de (1), packageVersion (1), read.csv (1), read.delim (1), timestamp (1)

leakr

leakr_import (3), new_temporal_detector (3), analyse_cluster_duplicates (2), analyse_temporal_target_relationship (2), calculate_cluster_similarity (2), calculate_feature_importance (2), calculate_mutual_information (2), calculate_pairwise_similarity (2), calculate_row_similarity (2), clean_column_names (2), cluster_similar_pairs (2), compile_report (2), detect_aggregation_leakage (2), detect_cluster_based_duplicates (2), detect_correlation_leakage (2), detect_exact_duplicates (2), detect_feature_importance_leakage (2), detect_id_duplicates (2), detect_near_duplicates (2), detect_perfect_separation (2), detect_subset_duplicates (2), detect_temporal_target_leakage (2), determine_correlation_severity (2), determine_duplication_severity (2), determine_risk_level (2), export_data_internal (2), find_subset_relationships (2), find_time_columns (2), format_evidence (2), generate_recommendations (2), list_registered_detectors (2), new_detector (2), new_train_test_detector (2), perform_duplicate_clustering (2), prepare_audit_data (2), process_duplication (2), run_detector (2), run_detector.temporal_detector (2), analyse_target_distribution (1), create_detector (1), detect_and_convert_dates_enhanced (1), detect_duplication (1), detect_file_format (1), detect_target_leakage (1), detect_train_test_contamination (1), empty_snapshot_info (1), format_detector_name (1), format_lines (1), generate_diagnostic_plots (1), generate_evidence_section (1), generate_executive_summary_text (1), generate_issues_section (1), generate_recommendations_section (1), get_detector (1), get_detector_info (1), import_csv (1), import_excel (1), import_json (1), import_parquet (1), import_rds (1), import_tsv (1), is_subset_row (1), leakr_audit (1), leakr_create_snapshot (1), leakr_export_data (1), leakr_from_caret (1), leakr_from_mlr3 (1), leakr_from_tidymodels (1), leakr_list_snapshots (1), leakr_load_snapshot (1), leakr_plot (1), leakr_quick_import (1), leakr_summarise (1), plot.detector_result (1), plot.udld_report (1), preprocess_imported_data (1), print.leakr_detector (1), print.leakr_report (1), register_detector (1), run_detector.default (1), run_detectors (1), stratified_sample (1), test_aggregation_pattern (1), test_perfect_separation (1)

ggplot2

ggplot (6), aes (5), theme (4), element_text (3), geom_bar (3), labs (3), theme_minimal (2)

stats

df (6), lm (3), sd (3), complete.cases (2), kmeans (2), median (2), chisq.test (1), cor.test (1), quantile (1)

graphics

lines (3), title (3), text (1)

jsonlite

fromJSON (4), toJSON (3)

digest

digest (5)

methods

is (3)

grDevices

palette (2)

readxl

excel_sheets (1), read_excel (1)

workflows

pull_workflow_preprocessor (1), pull_workflow_spec (1)

arrow

read_parquet (1)

data.table

fread (1)

tools

file_ext (1)

NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.

2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

code in R (100% in 14 files) and
1 authors
3 vignettes
no internal data file
10 imported packages
50 exported functions (median 21 lines of code)
111 non-exported functions in R (median 33 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used:

loc = "Lines of Code"
fn = "function"
exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure	value	percentile	noteworthy
files_R	11	60.1
files_inst	4	96.3
files_vignettes	3	89.3
files_tests	6	77.8
loc_R	1521	75.5
loc_inst	297	57.9
loc_vignettes	713	83.9
loc_tests	141	42.4
num_vignettes	3	90.9
n_fns_r	161	84.4
n_fns_r_exported	50	87.1
n_fns_r_not_exported	111	83.4
n_fns_per_file_r	6	77.5
num_params_per_fn	2	9.4
loc_per_fn_r	29	75.5
loc_per_fn_r_exp	22	50.7
loc_per_fn_r_not_exp	33	81.3
rel_whitespace_R	14	69.1
rel_whitespace_inst	19	58.3
rel_whitespace_vignettes	36	87.2
rel_whitespace_tests	22	42.0
doclines_per_fn_exp	12	5.4
doclines_per_fn_not_exp	0	0.0	TRUE
fn_call_network_size	82	74.1

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package

3. `goodpractice` and other checks

Details of goodpractice checks (click to open)

3b. `goodpractice` results

`R CMD check` with rcmdcheck

R CMD check generated the following notes:

checking for hidden files and directories ... NOTE Found the following hidden files and directories: .github These were most likely included in error. See section ‘Package structure’ in the ‘Writing R Extensions’ manual.
checking DESCRIPTION meta-information ... NOTE License stub is invalid DCF.

R CMD check generated the following check_fails:

description_url
description_bugreports
rcmdcheck_hidden_files_and_directories

Test coverage with covr

Package coverage: 10.9

The following files are not completely covered by tests:

file	coverage
R/core.R	0%
R/io.R	0%
R/leakR.R	0%
R/pkg-detector.R	0%
R/plot.R	0%
R/report.R	0%
R/viz.R	0%
R/zzz.R	0%

Cyclocomplexity with cyclocomp

The following functions have cyclocomplexity >= 15:

function	cyclocomplexity
preprocess_imported_data	31
detect_and_convert_dates_enhanced	30
run_detector.train_test_detector	23
run_detectors	22
prepare_audit_data	21
run_detector.temporal_detector	20

Static code analyses with lintr

lintr found no issues with this package!

4. Other Checks

Details of other checks (click to open)

:heavy_multiplication_x: The following function name is duplicated in other packages:

- %||% from infix, hset, formatters, fuj, arkhe, iNZightTools, arcgisutils, examly, powerbrmsINLA, rlang

Package Versions

package	version
pkgstats	0.2.0.93
pkgcheck	0.1.2.241

Editor-in-Chief Instructions:

Processing may not proceed until the items marked with :heavy_multiplication_x: have been resolved.

Dec 08 '25 11:12 ropensci-review-bot

@cherylisabella That shows a bit of work for you to address before we proceed. It'll likely help you to use https://github.com/ropensci-review-tools/pkgcheck-action to generate the same pkgcheck report in your own repos on each push. Once you're getting the all-clear there, feel free to ask the bot to check package here to confirm. Thanks!

Dec 08 '25 12:12 mpadge

@mpadge thank you again! I'll get started on those now :)

Dec 08 '25 13:12 cherylisabella

Please see details for closing at https://github.com/ropensci/software-review/issues/733#issuecomment-3665652325

Dec 17 '25 14:12 jhollist

leakr: detect and diagnose data leakage in machine learning workflows

Scope

Technical checks

Publication options

Code of conduct

Checks for leakr (v0.1.0)

1. Package Dependencies

2. Statistical Properties

2a. Network visualisation

3. goodpractice and other checks

3b. goodpractice results

R CMD check with rcmdcheck

Test coverage with covr

Cyclocomplexity with cyclocomp

Static code analyses with lintr

4. Other Checks

Editor-in-Chief Instructions:

3. `goodpractice` and other checks

3b. `goodpractice` results

`R CMD check` with rcmdcheck