leakr: detect and diagnose data leakage in machine learning workflows
Submitting Author Name: Cheryl Isabella Submitting Author Github Handle: @cherylisabella Repository: https://github.com/cherylisabella/leakr Version submitted: 0.1.0 Submission type: Standard Editor: TBD Reviewers: TBD
Archive: TBD Version accepted: TBD Language: en
- Paste the full DESCRIPTION file inside a code block below:
Package: leakr
Type: Package
Title: Data Leakage Detection Tools for Machine Learning
Version: 0.1.0
Authors@R: person(given = c("Cheryl", "Isabella"), family = "Lim", role = c("aut", "cre"), email =
"[email protected]")
Description: Provides utilities to detect common data leakage patterns including train/test
contamination, temporal leakage, and data duplication, enhancing model reliability and
reproducibility in machine learning workflows. Generates diagnostic reports and visual
summaries to support data validation. Methods based on best practices from Hastie,
Tibshirani, and Friedman (2009, ISBN:978-0387848570).
Imports: ggplot2, arrow, data.table, digest, htmltools, openxlsx, readxl, stringr, workflows, jsonlite
Suggests: testthat (>= 3.0.0), caret, mlr3, tidymodels, knitr, rmarkdown
License: MIT + file LICENSE
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
LazyData: false
RoxygenNote: 7.3.3
VignetteBuilder: knitr
Scope
-
Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- [ ] data retrieval
- [ ] data extraction
- [ ] data munging
- [ ] data deposition
- [x] data validation and testing
- [x] workflow automation
- [ ] version control
- [ ] citation management and bibliometrics
- [ ] scientific software wrappers
- [ ] field and lab reproducibility tools
- [ ] database software bindings
- [ ] geospatial data
- [ ] translation
Statistical Packages
-
[ ] Bayesian and Monte Carlo Routines
-
[ ] Dimensionality Reduction, Clustering, and Unsupervised Learning
-
[x] Machine Learning
-
[ ] Regression and Supervised Learning
-
[x] Exploratory Data Analysis (EDA) and Summary Statistics
-
[ ] Spatial Analyses
-
[ ] Time Series Analyses
-
[ ] Probability Distributions
-
Explain how and why the package falls under these categories (briefly, 1-2 sentences):
leakrprovides utilities to automatically detect common data‑leakage patterns (train/test contamination, target leakage, duplicate/near‑duplicate rows, temporal/data‑split leakage) in tabular data workflows. It generates diagnostic reports and visualizations to help users identify, evaluate, and correct leakage before model training. -
Who is the target audience and what are scientific applications of this package? Data scientists, statisticians, machine‑learning practitioners and researchers working with predictive models on tabular data. It is especially useful for anyone interested in ensuring model validity, preventing overfitting due to data leakage, maintaining reproducibility, and auditing machine learning workflows like in social sciences, epidemiology, economics, or any data‑driven research with predictive modeling.
-
Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category? No.
leakraddresses a very real problem in reproducible machine learning and data science workflows: data leakage. By providing a standardised toolkit to audit datasets and detect leakage early, it increases the reliability and transparency of analyses which aligns well with rOpenSci’s mission of promoting reproducible, open data science.leakris general (not tied to a single domain) and hence useful to researchers across fields like social science, epidemiology, economics, and ML. -
(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research? N/A.
-
If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or
@tagthe editor you contacted. -
Explain reasons for any
pkgcheckitems which your package is unable to pass.
Technical checks
Confirm each of the following by checking the box.
- [x] I have read the rOpenSci packaging guide.
- [x] I have read the author guide and I expect to maintain this package for at least 2 years or to find a replacement.
This package:
- [x] does not violate the Terms of Service of any service it interacts with.
- [x] has a CRAN and OSI accepted license.
- [x] contains a README with instructions for installing the development version.
- [x] includes documentation with examples for all functions, created with roxygen2.
- [x] contains a vignette with examples of its essential functions and uses.
- [x] has a test suite.
- [x] has continuous integration, including reporting of test coverage.
Publication options
-
[x] Do you intend for this package to go on CRAN?
-
[ ] Do you intend for this package to go on Bioconductor?
-
[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:
MEE Options
- [ ] The package is novel and will be of interest to the broad readership of the journal.
- [ ] The manuscript describing the package is no longer than 3000 words.
- [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
- (Scope: Do consider MEE's Aims and Scope for your manuscript. We make no guarantee that your manuscript will be within MEE scope.)
- (Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
- (Please do not submit your package separately to Methods in Ecology and Evolution)
Code of conduct
- [x] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.
Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.
The submission template is missing the following values: author1, repourl
:rocket:
Error: Issue template has no 'repourl'
:wave:
@ropensci-review-bot check package
I'm sorry @cherylisabella, I'm afraid I can't do that. That's something only editors, author1, author-others and reviewers-list are allowed to do.
@ropensci-review-bot check package
Thanks, about to send the query.
:rocket:
Editor check started
:wave:
Checks for leakr (v0.1.0)
git hash: 51d11185
- :heavy_check_mark: Package is already on CRAN.
- :heavy_multiplication_x: does not have a 'codemeta.json' file.
- :heavy_check_mark: has a 'contributing' file.
- :heavy_check_mark: uses 'roxygen2'.
- :heavy_multiplication_x: 'DESCRIPTION' does not have a URL field.
- :heavy_multiplication_x: 'DESCRIPTION' does not have a BugReports field.
- :heavy_check_mark: Package has at least one HTML vignette
- :heavy_multiplication_x: These functions do not have examples: [compile_report, format_detector_name, grapes-or-or-grapes, leakr_create_snapshot, leakr_export_data, leakr_from_caret, leakr_from_mlr3, leakr_from_tidymodels, leakr_import, leakr_list_snapshots, leakr_load_snapshot, leakr_plot, leakr_quick_import, new_temporal_detector, new_train_test_detector, plot.detector_result, plot.udld_report, register_detector, run_detector].
- :heavy_multiplication_x: Continuous integration checks unavailable (no URL in 'DESCRIPTION').
- :heavy_multiplication_x: Package coverage is 10.9% (should be at least 75%).
- :heavy_check_mark: R CMD check found no errors.
- :heavy_check_mark: R CMD check found no warnings.
- :eyes: Some goodpractice linters failed.
- :eyes: Function names are duplicated in other packages
Important: All failing checks above must be addressed prior to proceeding
(Checks marked with :eyes: may be optionally addressed.)
Package License: MIT + file LICENSE
1. Package Dependencies
Details of Package Dependency Usage (click to open)
The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.
| type | package | ncalls |
|---|---|---|
| internal | base | 1086 |
| internal | utils | 165 |
| internal | leakr | 124 |
| internal | stats | 21 |
| internal | graphics | 7 |
| internal | methods | 3 |
| internal | grDevices | 2 |
| internal | tools | 1 |
| imports | ggplot2 | 26 |
| imports | jsonlite | 7 |
| imports | digest | 5 |
| imports | readxl | 2 |
| imports | workflows | 2 |
| imports | arrow | 1 |
| imports | data.table | 1 |
| imports | htmltools | NA |
| imports | openxlsx | NA |
| imports | stringr | NA |
| suggests | testthat | NA |
| suggests | caret | NA |
| suggests | mlr3 | NA |
| suggests | tidymodels | NA |
| suggests | knitr | NA |
| suggests | rmarkdown | NA |
| linking_to | NA | NA |
Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.
base
list (286), return (91), length (40), data.frame (39), for (37), nrow (33), c (32), is.na (29), if (28), names (28), sum (26), drop (21), sapply (18), split (18), format (17), sprintf (17), as.character (16), vapply (13), class (12), table (12), unique (12), which (11), max (10), as.numeric (9), min (9), character (8), seq_len (8), apply (7), det (7), numeric (7), file.path (6), ifelse (6), levels (6), mean (6), ncol (6), structure (6), Sys.time (6), abs (5), lapply (5), switch (5), unlist (5), as.Date (4), assign (4), logical (4), sample (4), source (4), sqrt (4), as.factor (3), as.matrix (3), dim (3), factor (3), intersect (3), paste (3), pretty (3), rbind (3), with (3), all (2), args (2), do.call (2), emptyenv (2), expand.grid (2), file (2), file.exists (2), file.info (2), grepl (2), inherits (2), is.factor (2), is.numeric (2), lengths (2), new.env (2), paste0 (2), range (2), rep (2), rowSums (2), setdiff (2), tempdir (2), tolower (2), which.max (2), as.POSIXct (1), asNamespace (1), attr (1), basename (1), col (1), colSums (1), diff (1), dirname (1), floor (1), get (1), is.character (1), is.finite (1), is.list (1), list.dirs (1), log2 (1), ls (1), mode (1), R.version.string (1), readLines (1), readRDS (1), round (1), scale (1), seq_along (1), sort (1), sub (1), suppressWarnings (1), union (1), vector (1)
utils
data (153), stack (3), head (2), modifyList (2), de (1), packageVersion (1), read.csv (1), read.delim (1), timestamp (1)
leakr
leakr_import (3), new_temporal_detector (3), analyse_cluster_duplicates (2), analyse_temporal_target_relationship (2), calculate_cluster_similarity (2), calculate_feature_importance (2), calculate_mutual_information (2), calculate_pairwise_similarity (2), calculate_row_similarity (2), clean_column_names (2), cluster_similar_pairs (2), compile_report (2), detect_aggregation_leakage (2), detect_cluster_based_duplicates (2), detect_correlation_leakage (2), detect_exact_duplicates (2), detect_feature_importance_leakage (2), detect_id_duplicates (2), detect_near_duplicates (2), detect_perfect_separation (2), detect_subset_duplicates (2), detect_temporal_target_leakage (2), determine_correlation_severity (2), determine_duplication_severity (2), determine_risk_level (2), export_data_internal (2), find_subset_relationships (2), find_time_columns (2), format_evidence (2), generate_recommendations (2), list_registered_detectors (2), new_detector (2), new_train_test_detector (2), perform_duplicate_clustering (2), prepare_audit_data (2), process_duplication (2), run_detector (2), run_detector.temporal_detector (2), analyse_target_distribution (1), create_detector (1), detect_and_convert_dates_enhanced (1), detect_duplication (1), detect_file_format (1), detect_target_leakage (1), detect_train_test_contamination (1), empty_snapshot_info (1), format_detector_name (1), format_lines (1), generate_diagnostic_plots (1), generate_evidence_section (1), generate_executive_summary_text (1), generate_issues_section (1), generate_recommendations_section (1), get_detector (1), get_detector_info (1), import_csv (1), import_excel (1), import_json (1), import_parquet (1), import_rds (1), import_tsv (1), is_subset_row (1), leakr_audit (1), leakr_create_snapshot (1), leakr_export_data (1), leakr_from_caret (1), leakr_from_mlr3 (1), leakr_from_tidymodels (1), leakr_list_snapshots (1), leakr_load_snapshot (1), leakr_plot (1), leakr_quick_import (1), leakr_summarise (1), plot.detector_result (1), plot.udld_report (1), preprocess_imported_data (1), print.leakr_detector (1), print.leakr_report (1), register_detector (1), run_detector.default (1), run_detectors (1), stratified_sample (1), test_aggregation_pattern (1), test_perfect_separation (1)
ggplot2
ggplot (6), aes (5), theme (4), element_text (3), geom_bar (3), labs (3), theme_minimal (2)
stats
df (6), lm (3), sd (3), complete.cases (2), kmeans (2), median (2), chisq.test (1), cor.test (1), quantile (1)
graphics
lines (3), title (3), text (1)
jsonlite
fromJSON (4), toJSON (3)
digest
digest (5)
methods
is (3)
grDevices
palette (2)
readxl
excel_sheets (1), read_excel (1)
workflows
pull_workflow_preprocessor (1), pull_workflow_spec (1)
arrow
read_parquet (1)
data.table
fread (1)
tools
file_ext (1)
NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.
2. Statistical Properties
This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.
Details of statistical properties (click to open)
The package has:
- code in R (100% in 14 files) and
- 1 authors
- 3 vignettes
- no internal data file
- 10 imported packages
- 50 exported functions (median 21 lines of code)
- 111 non-exported functions in R (median 33 lines of code)
Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used:
-
loc= "Lines of Code" -
fn= "function" -
exp/not_exp= exported / not exported
All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function
The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.
| measure | value | percentile | noteworthy |
|---|---|---|---|
| files_R | 11 | 60.1 | |
| files_inst | 4 | 96.3 | |
| files_vignettes | 3 | 89.3 | |
| files_tests | 6 | 77.8 | |
| loc_R | 1521 | 75.5 | |
| loc_inst | 297 | 57.9 | |
| loc_vignettes | 713 | 83.9 | |
| loc_tests | 141 | 42.4 | |
| num_vignettes | 3 | 90.9 | |
| n_fns_r | 161 | 84.4 | |
| n_fns_r_exported | 50 | 87.1 | |
| n_fns_r_not_exported | 111 | 83.4 | |
| n_fns_per_file_r | 6 | 77.5 | |
| num_params_per_fn | 2 | 9.4 | |
| loc_per_fn_r | 29 | 75.5 | |
| loc_per_fn_r_exp | 22 | 50.7 | |
| loc_per_fn_r_not_exp | 33 | 81.3 | |
| rel_whitespace_R | 14 | 69.1 | |
| rel_whitespace_inst | 19 | 58.3 | |
| rel_whitespace_vignettes | 36 | 87.2 | |
| rel_whitespace_tests | 22 | 42.0 | |
| doclines_per_fn_exp | 12 | 5.4 | |
| doclines_per_fn_not_exp | 0 | 0.0 | TRUE |
| fn_call_network_size | 82 | 74.1 |
2a. Network visualisation
Click to see the interactive network visualisation of calls between objects in package
3. goodpractice and other checks
Details of goodpractice checks (click to open)
3b. goodpractice results
R CMD check with rcmdcheck
R CMD check generated the following notes:
- checking for hidden files and directories ... NOTE Found the following hidden files and directories: .github These were most likely included in error. See section ‘Package structure’ in the ‘Writing R Extensions’ manual.
- checking DESCRIPTION meta-information ... NOTE License stub is invalid DCF.
R CMD check generated the following check_fails:
- description_url
- description_bugreports
- rcmdcheck_hidden_files_and_directories
Test coverage with covr
Package coverage: 10.9
The following files are not completely covered by tests:
| file | coverage |
|---|---|
| R/core.R | 0% |
| R/io.R | 0% |
| R/leakR.R | 0% |
| R/pkg-detector.R | 0% |
| R/plot.R | 0% |
| R/report.R | 0% |
| R/viz.R | 0% |
| R/zzz.R | 0% |
Cyclocomplexity with cyclocomp
The following functions have cyclocomplexity >= 15:
| function | cyclocomplexity |
|---|---|
| preprocess_imported_data | 31 |
| detect_and_convert_dates_enhanced | 30 |
| run_detector.train_test_detector | 23 |
| run_detectors | 22 |
| prepare_audit_data | 21 |
| run_detector.temporal_detector | 20 |
Static code analyses with lintr
lintr found no issues with this package!
4. Other Checks
Details of other checks (click to open)
:heavy_multiplication_x: The following function name is duplicated in other packages:
-
-
%||%from infix, hset, formatters, fuj, arkhe, iNZightTools, arcgisutils, examly, powerbrmsINLA, rlang
-
Package Versions
| package | version |
|---|---|
| pkgstats | 0.2.0.93 |
| pkgcheck | 0.1.2.241 |
Editor-in-Chief Instructions:
Processing may not proceed until the items marked with :heavy_multiplication_x: have been resolved.
@cherylisabella That shows a bit of work for you to address before we proceed. It'll likely help you to use https://github.com/ropensci-review-tools/pkgcheck-action to generate the same pkgcheck report in your own repos on each push. Once you're getting the all-clear there, feel free to ask the bot to check package here to confirm. Thanks!
@mpadge thank you again! I'll get started on those now :)
Please see details for closing at https://github.com/ropensci/software-review/issues/733#issuecomment-3665652325