datadiff Build framework for synthetic experiments

Following Charles's notes:

[x] Provide access to UCI datasets
[x] Add functions to randomly sample from the set of valid patch objects:
- [x] permute
- [x] shift & scale
- [x] recode
- [x] insert & delete
- [x] break
[x] Function to string together patch samplers to return a composed patch
[x] ~~Function to standardise a patch object (or to test for standard form)~~
[x] Define data structures for experiment config & results
[ ] Properly document the synthetic_experiment class (both before & after execution)
[ ] Implement test harness for running experiments
- [x] handle batch jobs (multiple datasets & corruptions)
- [ ] exclude non-column-wise unique corruptions from the experiment config
[ ] Implement evaluation metric(s):
- [x] false positive rate
- [x] fidelity metrics:
  - [x] precision & recall
  - [x] parameter accuracy:
    - [x] hamming distance for permuations
    - [x] column accuracy
    - [x] parameter RMSE
    - [x] parameter accuracy
- [ ] robustness metric
[ ] Add the formula by which the metric is defined to the man page of the corresponding metric calculation function.
[ ] Write functions to summarise/visualise results of synthetic experiments:
- [x] generate table of aggregate performance metrics
[x] Run experiments for default ddiff implementation

Jul 13 '17 14:07 thobson88

Synthetic experiment configuration

Parameters:

data frame identifier (e.g. source & dataset name)
candidate datadiff function
corruption parameters consisting of a list of sample_patch_xxx closures appropriate for the data
number of runs
vector of data frame split ratios
random seed

Output:

copy of all parameters
list of realised corruption patches (or list of list)
list of patches (or list of list)
execution times

Evaluation function takes:

experiment output
list of evaluation metrics (as functions)

Results:

copy of all parameters
results of all evaluation metrics
execution times

Jul 20 '17 13:07 thobson88

Fidelity metrics

Patch standard form

It may be easier/quicker to test for the particular "standard" characteristics necessary for a given operation, rather than to have a general function for standardising an arbitrary patch. There are a lot of possible combinations of patches so testing a standardise_patch function might be time consuming and would provide less certainty of correctness.

For instance, in metric_column_accuracy we could easily check that both the corruption and the result patches contain at most one component patch of a given type and column index. This will certainly be the case if both corruption & result are in standard form, but is a less strict condition which is easier to check and sufficient for the correct calculation of the column accuracy metric. We do this in the function is_columnwise_unique.

Column accuracy

Some care is required in order to correctly implement this metric. At first glance it appears to be a simple case of comparing the column index parameter cols in the corruption with that in the datadiff result. However, when the corruption and/or the result is a composed patch, the column indices may have been rearranged by a preceding permutation patch, or shifted by a preceding insert or delete patch. So we must take into account the entire composition, up to the patch of interest, in order to identify the relevant column index. We do this in the function initial_column_position.

This issue affects the formula for the Column accuracy metric which, as currently expressed, assumes that the corruption and result are not compositions. Given the preceding point, we cannot simply focus on a particular component of interest and ignore the rest of the composition.

There is another issue: a corruption may contain two (or more) elementary patches of the same type applied to different columns. In that case we could define column accuracy, for a given corruption p_gold and result p_result, as either a binary value (giving credit only if p_result transforms exactly the same columns as p_gold) or a fraction (giving partial credit in case there is partial overlap). The pairwise_column_accuracy function implements both possibilities, with a logical flag named "partial" to select which one to compute. Note that in the case of partial credit, the formula for the pairwise column accuracy between a given p_gold and p_result is

                     #{common columns}
---------------------------------------------------------
max(#{columns txd by p_gold}, #{columns txd by p_result})

This ensures that partial credit is not awarded undeservedly (e.g. if p_gold transforms only a few columns but p_result transforms all columns, only minimal partial credit will be awarded).

Jul 31 '17 10:07 thobson88

datadiff datadiff copied to clipboard

Build framework for synthetic experiments

Synthetic experiment configuration

Fidelity metrics

Patch standard form

Column accuracy

datadiff
datadiff copied to clipboard