datadiff
datadiff copied to clipboard
Build framework for synthetic experiments
Following Charles's notes:
- [x] Provide access to UCI datasets
- [x] Add functions to randomly sample from the set of valid patch objects:
- [x] permute
- [x] shift & scale
- [x] recode
- [x] insert & delete
- [x] break
- [x] Function to string together patch samplers to return a composed patch
- [x] ~~Function to standardise a patch object (or to test for standard form)~~
- [x] Define data structures for experiment config & results
- [ ] Properly document the synthetic_experiment class (both before & after execution)
- [ ] Implement test harness for running experiments
- [x] handle batch jobs (multiple datasets & corruptions)
- [ ] exclude non-column-wise unique corruptions from the experiment config
- [ ] Implement evaluation metric(s):
- [x] false positive rate
- [x] fidelity metrics:
- [x] precision & recall
- [x] parameter accuracy:
- [x] hamming distance for permuations
- [x] column accuracy
- [x] parameter RMSE
- [x] parameter accuracy
- [ ] robustness metric
- [ ] Add the formula by which the metric is defined to the man page of the corresponding metric calculation function.
- [ ] Write functions to summarise/visualise results of synthetic experiments:
- [x] generate table of aggregate performance metrics
- [x] Run experiments for default ddiff implementation
Synthetic experiment configuration
Parameters:
- data frame identifier (e.g. source & dataset name)
- candidate datadiff function
- corruption parameters consisting of a list of sample_patch_xxx closures appropriate for the data
- number of runs
- vector of data frame split ratios
- random seed
Output:
- copy of all parameters
- list of realised corruption patches (or list of list)
- list of patches (or list of list)
- execution times
Evaluation function takes:
- experiment output
- list of evaluation metrics (as functions)
Results:
- copy of all parameters
- results of all evaluation metrics
- execution times
Fidelity metrics
Patch standard form
It may be easier/quicker to test for the particular "standard" characteristics necessary for a given operation, rather than to have a general function for standardising an arbitrary patch. There are a lot of possible combinations of patches so testing a standardise_patch
function might be time consuming and would provide less certainty of correctness.
For instance, in metric_column_accuracy
we could easily check that both the corruption and the result patches contain at most one component patch of a given type and column index. This will certainly be the case if both corruption & result are in standard form, but is a less strict condition which is easier to check and sufficient for the correct calculation of the column accuracy metric. We do this in the function is_columnwise_unique
.
Column accuracy
Some care is required in order to correctly implement this metric. At first glance it appears to be a simple case of comparing the column index parameter cols
in the corruption with that in the datadiff result. However, when the corruption and/or the result is a composed patch, the column indices may have been rearranged by a preceding permutation patch, or shifted by a preceding insert or delete patch. So we must take into account the entire composition, up to the patch of interest, in order to identify the relevant column index. We do this in the function initial_column_position
.
This issue affects the formula for the Column accuracy metric which, as currently expressed, assumes that the corruption and result are not compositions. Given the preceding point, we cannot simply focus on a particular component of interest and ignore the rest of the composition.
There is another issue: a corruption may contain two (or more) elementary patches of the same type applied to different columns. In that case we could define column accuracy, for a given corruption p_gold
and result p_result
, as either a binary value (giving credit only if p_result transforms exactly the same columns as p_gold) or a fraction (giving partial credit in case there is partial overlap). The pairwise_column_accuracy function implements both possibilities, with a logical flag named "partial" to select which one to compute. Note that in the case of partial credit, the formula for the pairwise column accuracy between a given p_gold
and p_result
is
#{common columns}
---------------------------------------------------------
max(#{columns txd by p_gold}, #{columns txd by p_result})
This ensures that partial credit is not awarded undeservedly (e.g. if p_gold transforms only a few columns but p_result transforms all columns, only minimal partial credit will be awarded).