datadiff icon indicating copy to clipboard operation
datadiff copied to clipboard

[WiP] Adding support for constraints

Open tpetricek opened this issue 5 years ago • 2 comments

This is work in progress, but it's adding a way of specifying constraints for ddiff. This is motivated by our work on AI assistants where we need this.

The idea is that you can call ddiff with a list of constraints that restrict some of the options that are generated by columnwise_candidates.

For example:

cs = list(
  constraint_nomatch("LLU", "Nation"),  # Never match LLU column to Nation column
  constraint_notransform("Technology"), # Never transform (scale or recode) Technology column
  constraint_match("Urban.rural", "URBAN2") # Match Urban.rural to URBAN2 and nothing else
)
bb15 <- subset(broadband2015, select=c("URBAN2","Nation",
  "DL24hrmean","UL24hrmean","Latency24hr","Web24hr"))
p <- ddiff(broadband2014, bb15, constraints=constraints, verbose=TRUE)
p(broadband2014)

The implementation basically just sets some penalties to 9999, but it also makes things a bit faster, because it does not try to generate other patches if there is a rule that forbids them.

If there is any interest in merging this, I would be happy to clean the code a bit - right now, it seems to work, but there are no tests and no documentation.

tpetricek avatar Sep 25 '19 00:09 tpetricek

This is addressing: https://github.com/alan-turing-institute/aida-datadiff/issues/36

tpetricek avatar Sep 25 '19 00:09 tpetricek

Running the unit tests I found that the existing test for columnwise_candidates fails because the constraints argument is missing.

Is it possible to have a default argument to indicate "no constraints" (e.g. an empty list), to make the change backwards-compatible and keep the interface clean for anyone working without constraints?

thobson88 avatar Nov 22 '19 15:11 thobson88