daff
daff copied to clipboard
Handle class
Hi,
This is not really a bug, but a feature request: is it able to detect class changes?
> y <- iris[1:3,]
> x <- y
> x$Sepal.Width <- as.character(x$Sepal.Width)
> x$Sepal.Length <- as.factor(x$Sepal.Length)
> class(x[, 1])
[1] "factor"
> class(x[, 2])
[1] "character"
> class(x[, 3])
[1] "numeric"
> render_diff(diff_data(y, x))
@@ | Sepal.Length | Sepal.Width | ... |
---|
Thanks! David
Hi David,
Very good question! The answer is yes and no.
The object returned by diff_data
knows the class changes. They are used by patch_data
> y <- iris[1:3,]
> x <- y
> x$Sepal.Width <- as.character(x$Sepal.Width)
> x$Sepal.Length <- as.factor(x$Sepal.Length)
> str(patch_data(y, diff_data(y, x)))
'data.frame': 3 obs. of 5 variables:
$ Sepal.Length: Factor w/ 3 levels "4.7","4.9","5.1": 3 2 1
$ Sepal.Width : chr "3.5" "3" "3.2"
$ Petal.Length: num 1.4 1.4 1.3
$ Petal.Width : num 0.2 0.2 0.2
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1
However the Coopy Diff Format does not know about type changes so storing a diff and loading it, won't give the correct patch.
x <- y <- iris[1:3,]
x$Sepal.Width <- as.character(x$Sepal.Width)
x$Sepal.Length <- as.factor(x$Sepal.Length)
write_diff(diff_data(y, x), file = "diff.csv")
d_yx <- read_diff("diff.csv")
str(patch_data(y, d_yx))
The author of the Coopy Diff Format welcomes the addition of type information, which is a good thing. Note however that this probably will not handle factor/character switches, since this is very R specific.
I will try to add class changes to the diff format, but this will not land into the master branch before april (due to time constraints). Current thoughts:
- extend the coopy highlighter diff format with type info
- have a R specific storage format: e.g. factors can be quite nasty: different order of levels for example. diff_data and patch_data take this into account, but it is not stored in the diff format.
One way to do this with the existing diff format would be to generate two diffs: the existing one comparing content, and a second one that compares metadata. So for a table like this:
Sepia.Length | Sepia.Width | ... |
---|---|---|
10.2 | 55.3 | ... |
9.1 | 1.2 | ... |
There would also be another (imaginary) table of type information like this:
Name | TypeNuance1 | TypeNuance2 | ... |
---|---|---|---|
Sepia.Length | ... | ... | ... |
Sepia.Width | ... | ... | ... |
... | ... | ... | ... |
Where the columns are a flattened version of whatever parameters it takes to fully describe types in R. Diffs of the second table would then give a clear record of type changes.
@paulfitz I like your idea of creating a metadiff, which is very flexible.
However I see the following issues:
- In stead of one diff file we end up with two diff files. I think it is desirable to store the diff in one textual form.
-
R
has so calledfactor
columns, which use a fixed ordered set of values (calledlevels
). Each factor column has its own levels. InR
it is desirable to track changes in this set of values. e.g.c("female", "male")
->("male", "female")
or("female", "male")
->("total", "female", "male")
. I'm not sure what the best way is to code this in a metadiff. (one column or many columns..)
Maybe we should have both: encode simple type information in the coopy highlighter diff format and extended type information in a meta diff
Shall we move this discussion to http://dataprotocols.org/tabular-diff-format ?
Perhaps it would be easiest to implement a second function ('diff_meta'?) to handle metadata changes.
A simplistic way to handle (Changes in) Factor levels is to concatenate them into a single string with some separator, e.g.
set.seed(42)
f <- factor(sample(letters[1:5], 20, replace=TRUE))
f.levels <- paste( dQuote(levels(f)), collapse=", ")
f.levels
My overall feeling is that it is hard to make meta stuff work cleanly enough for it to actually save people time for real.
That said: some basic support of column meta-data diffing did end up added to daff. For example, when comparing sqlite databases:
Basically, a data table can have a meta-table where cells specify properties of the columns of the data table. For sqlite databases I added just a single "type" property. When showing diffs, the meta-table is diffed just like regular tables, then prepended to the data diff with
@
decoration to distinguish it from data.