Tim Hobson issues

Results 13 issues of


                                            Tim Hobson

Fix bugs in batch experiment reporting

Bugs: - [x] Reported type precision & recall are identical across all patch types - [x] Column accuracy metric incorrectly calculated in cases of failed type recall - [ ]...

Create calibration dataset for patch_break

One approach to calibration is to choose penalty parameters based on the results of experiments on standard (UCI) datasets. If this approach is adopted (there are others), an additional dataset...

Build framework for synthetic experiments

Following [Charles's notes](https://github.com/alan-turing-institute/aida-datadiff/blob/master/notes/datadiff-experiment-plan.md): - [x] Provide access to UCI datasets - [x] Add functions to randomly sample from the set of valid patch objects: - [x] permute - [x] shift...

Handle corruptions involving both column insertions & deletions

Currently the ddiff algorithm works only under the assumption that columns are either inserted or deleted (or neither), but not both.

Add a final "mixed corruption" tests to test-extract_canonical_permutation.

Existing unit tests cover inserts, inserts+permutes, deletes & deletes+permutes. A final "mixed" test is required which involves all three corruption types.

Experiment with different penalties for shift & scale patches to improve observed poor precision & recall.

CS wrote on 17/08/2017: Performance on detecting shift and scale is not very good. This could be for at least three reasons: a) The synthetic problems are too hard. i.e....

Rename patch_perm as patch_permute and change type from "perm" to "permute".

Improve procedure for generating best-guess patches with real-valued parameters

Currently parameters are estimated by passing the diffness measure to the generic R optimisation function (stats::optimise). We could certainly make this more efficient for the specific case of the K-S...

Test for identical handling on factors and character vectors

The performance of ddiff ought not to depend on whether a column contains a character vector or a factor, but a discrepancy was observed when running on the UK broadband...