daff icon indicating copy to clipboard operation
daff copied to clipboard

Handle class

Open eusebe opened this issue 9 years ago • 6 comments

Hi,

This is not really a bug, but a feature request: is it able to detect class changes?

> y <- iris[1:3,]
> x <- y
> x$Sepal.Width <- as.character(x$Sepal.Width)
> x$Sepal.Length <- as.factor(x$Sepal.Length)
> class(x[, 1])
[1] "factor"
> class(x[, 2])
[1] "character"
> class(x[, 3])
[1] "numeric"
> render_diff(diff_data(y, x))
@@Sepal.LengthSepal.Width...

Thanks! David

eusebe avatar Feb 24 '15 12:02 eusebe

Hi David,

Very good question! The answer is yes and no.

The object returned by diff_data knows the class changes. They are used by patch_data

 > y <- iris[1:3,]
> x <- y
> x$Sepal.Width <- as.character(x$Sepal.Width)
> x$Sepal.Length <- as.factor(x$Sepal.Length)
> str(patch_data(y, diff_data(y, x)))
'data.frame':   3 obs. of  5 variables:
 $ Sepal.Length: Factor w/ 3 levels "4.7","4.9","5.1": 3 2 1
 $ Sepal.Width : chr  "3.5" "3" "3.2"
 $ Petal.Length: num  1.4 1.4 1.3
 $ Petal.Width : num  0.2 0.2 0.2
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1

However the Coopy Diff Format does not know about type changes so storing a diff and loading it, won't give the correct patch.

x <- y <- iris[1:3,]
x$Sepal.Width <- as.character(x$Sepal.Width)
x$Sepal.Length <- as.factor(x$Sepal.Length)
write_diff(diff_data(y, x), file = "diff.csv")
d_yx <- read_diff("diff.csv")
str(patch_data(y, d_yx))

The author of the Coopy Diff Format welcomes the addition of type information, which is a good thing. Note however that this probably will not handle factor/character switches, since this is very R specific.

edwindj avatar Feb 24 '15 12:02 edwindj

I will try to add class changes to the diff format, but this will not land into the master branch before april (due to time constraints). Current thoughts:

  • extend the coopy highlighter diff format with type info
  • have a R specific storage format: e.g. factors can be quite nasty: different order of levels for example. diff_data and patch_data take this into account, but it is not stored in the diff format.

edwindj avatar Feb 24 '15 14:02 edwindj

One way to do this with the existing diff format would be to generate two diffs: the existing one comparing content, and a second one that compares metadata. So for a table like this:

Sepia.Length Sepia.Width ...
10.2 55.3 ...
9.1 1.2 ...

There would also be another (imaginary) table of type information like this:

Name TypeNuance1 TypeNuance2 ...
Sepia.Length ... ... ...
Sepia.Width ... ... ...
... ... ... ...

Where the columns are a flattened version of whatever parameters it takes to fully describe types in R. Diffs of the second table would then give a clear record of type changes.

paulfitz avatar Feb 24 '15 15:02 paulfitz

@paulfitz I like your idea of creating a metadiff, which is very flexible.

However I see the following issues:

  • In stead of one diff file we end up with two diff files. I think it is desirable to store the diff in one textual form.
  • R has so called factor columns, which use a fixed ordered set of values (called levels). Each factor column has its own levels. In R it is desirable to track changes in this set of values. e.g. c("female", "male") -> ("male", "female") or ("female", "male") -> ("total", "female", "male"). I'm not sure what the best way is to code this in a metadiff. (one column or many columns..)

Maybe we should have both: encode simple type information in the coopy highlighter diff format and extended type information in a meta diff

Shall we move this discussion to http://dataprotocols.org/tabular-diff-format ?

edwindj avatar Feb 25 '15 08:02 edwindj

Perhaps it would be easiest to implement a second function ('diff_meta'?) to handle metadata changes.

A simplistic way to handle (Changes in) Factor levels is to concatenate them into a single string with some separator, e.g.

set.seed(42)
f <- factor(sample(letters[1:5], 20, replace=TRUE))
f.levels <- paste( dQuote(levels(f)), collapse=", ")
f.levels

gwarnes-mdsol avatar Mar 24 '17 13:03 gwarnes-mdsol

My overall feeling is that it is hard to make meta stuff work cleanly enough for it to actually save people time for real.

That said: some basic support of column meta-data diffing did end up added to daff. For example, when comparing sqlite databases: sqldiff Basically, a data table can have a meta-table where cells specify properties of the columns of the data table. For sqlite databases I added just a single "type" property. When showing diffs, the meta-table is diffed just like regular tables, then prepended to the data diff with @ decoration to distinguish it from data.

paulfitz avatar Mar 25 '17 15:03 paulfitz