daff icon indicating copy to clipboard operation
daff copied to clipboard

Changed line appears as added + removed

Open gwarnes-mdsol opened this issue 7 years ago • 11 comments

The daff comparison algorithm improperly marks a row with changed data as an added/removed pair.

For instance, comparing the CSV files 'iris.csv' and 'iris2.csv' (via the edwinj/daff R wrapper), I get the following diff:

@@	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
...	...	...	...	...	...
	5.7	2.8	4.1	1.3	versicolor
->	6.3	3.3	6	2.5	virginica->XXX
+++	5.8	2.7	5.1	1.9	XXX
---	5.8	2.7	5.1	1.9	virginica
->	7.1	3	5.9	2.1	virginica->XXX
->	6.3	2.9	5.6	1.8	virginica->XXX
->	6.5	3	5.8	2.2	virginica->XXX
->	7.6	3	6.6	2.1	virginica->XXX
->	4.9	2.5	4.5	1.7	virginica->XXX
->	7.3	2.9	6.3	1.8	virginica->XXX
->	6.7	2.5	5.8	1.8	virginica->XXX
->	7.2	3.6	6.1	2.5	virginica->XXX
->	6.5	3.2	5.1	2	virginica->XXX
->	6.4	2.7	5.3	1.9	virginica->XXX
->	6.8	3	5.5	2.1	virginica->XXX
->	5.7	2.5	5	2	virginica->XXX
->	5.8	2.8	5.1	2.4	virginica->XXX
->	6.4	3.2	5.3	2.3	virginica->XXX
->	6.5	3	5.5	1.8	virginica->XXX
->	7.7	3.8	6.7	2.2	virginica->XXX
->	7.7	2.6	6.9	2.3	virginica->XXX
->	6	2.2	5	1.5	virginica->XXX
->	6.9	3.2	5.7	2.3	virginica->XXX
->	5.6	2.8	4.9	2	virginica->XXX
->	7.7	2.8	6.7	2	virginica->XXX
->	6.3	2.7	4.9	1.8	virginica->XXX
->	6.7	3.3	5.7	2.1	virginica->XXX
->	7.2	3.2	6	1.8	virginica->XXX
->	6.2	2.8	4.8	1.8	virginica->XXX
->	6.1	3	4.9	1.8	virginica->XXX
->	6.4	2.8	5.6	2.1	virginica->XXX
->	7.2	3	5.8	1.6	virginica->XXX
->	7.4	2.8	6.1	1.9	virginica->XXX
->	7.9	3.8	6.4	2	virginica->XXX
->	6.4	2.8	5.6	2.2	virginica->XXX
->	6.3	2.8	5.1	1.5	virginica->XXX
->	6.1	2.6	5.6	1.4	virginica->XXX
->	7.7	3	6.1	2.3	virginica->XXX
->	6.3	3.4	5.6	2.4	virginica->XXX
->	6.4	3.1	5.5	1.8	virginica->XXX
->	6	3	4.8	1.8	virginica->XXX
->	6.9	3.1	5.4	2.1	virginica->XXX
->	6.7	3.1	5.6	2.4	virginica->XXX
->	6.9	3.1	5.1	2.3	virginica->XXX
+++	5.8	2.7	5.1	1.9	XXX
---	5.8	2.7	5.1	1.9	virginica
->	6.8	3.2	5.9	2.3	virginica->XXX
->	6.7	3.3	5.7	2.5	virginica->XXX
->	6.7	3	5.2	2.3	virginica->XXX
->	6.3	2.5	5	1.9	virginica->XXX
->	6.5	3	5.2	2	virginica->XXX
->	6.2	3.4	5.4	2.3	virginica->XXX
->	5.9	3	5.1	1.8	virginica->XXX

As you can see, the pair of lines

+++	5.8	2.7	5.1	1.9	XXX
---	5.8	2.7	5.1	1.9	virginica

are shown as an addition + deletion, when they are actually a change in a single column.

For some large files--but not in this file--I see trios or more complex patterns of added/deleted/modified lines where changes in the values in two or more rows are displayed as a mix of modifications to unmatched rows, combined with additions + deletions. Something like:

+++	5.8		2.7		5.1		1.9		XXX
-->	6.8->5.8	3.2-->2.7	5.9->5.1	2.3->1.9	virginical->XXX
---	5.8		3.2		5.1		1.9		virginica

gwarnes-mdsol avatar Apr 14 '17 22:04 gwarnes-mdsol

Hi @gwarnes-mdsol, could you do me a favor and attach the .csv files, or forward them by email? (my email address is attached to my github profile). Thanks!

paulfitz avatar Apr 17 '17 20:04 paulfitz

Sorry about that. BTW, github doesn't like the extension .csv so I added .txt to make it happy.

iris.csv.txt iris2.csv.txt

gwarnes-mdsol avatar Apr 17 '17 21:04 gwarnes-mdsol

Thanks for the files. From the command line, with daff iris.csv.txt iris2.csv.txt, I'm not seeing the same diff unfortunately, it gives -> updates everywhere. There was an extra column that looked like a row number, but removing it also wasn't sufficient to replicate. How hard would it be to talk me through how to replicate using R?

paulfitz avatar Apr 18 '17 01:04 paulfitz

Hi Paul, it is pretty simple to replicate in R. I'll try to take some time tomorrow to write brief instructions. In the mean time, installing R would be the first step, :-) http://r-project.org

On Mon, Apr 17, 2017 at 9:46 PM Paul Fitzpatrick [email protected] wrote:

Thanks for the files. From the command line, with daff iris.csv.txt iris2.csv.txt, I'm not seeing the same diff unfortunately, it gives -> updates everywhere. There was an extra column that looked like a row number, but removing it also wasn't sufficient to replicate. How hard would it be to talk me through how to replicate using R?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_paulfitz_daff_issues_91-23issuecomment-2D294648437&d=DwMFaQ&c=fi2D4-9xMzmjyjREwHYlAw&r=PUo6rYTmGeWkBJepZc1UHw629PctwMWQF8I3RzhQlL8&m=Y09aeUbp46EnkWxCzc6ZJAo3HC8hn4cOFDekMlehE2c&s=5JvZ9XU6ebKlqbYC2CQ0gEs-6DnsLeI85D8a_B-k_fA&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AVNo-2DrVMkmHDgf8O2EdrjxnHvTYOpItZks5rxBXigaJpZM4M-2DIyB&d=DwMFaQ&c=fi2D4-9xMzmjyjREwHYlAw&r=PUo6rYTmGeWkBJepZc1UHw629PctwMWQF8I3RzhQlL8&m=Y09aeUbp46EnkWxCzc6ZJAo3HC8hn4cOFDekMlehE2c&s=xSzxUauheirwQof3g7MQvardno2VWAwF4U1n6bhA5E4&e= .

gwarnes-mdsol avatar Apr 18 '17 18:04 gwarnes-mdsol

Here's the R code to replicate:

install.packages("devtools")
devtools::install_github("edwindj/daff")
library(daff)
iris2 <- iris
levels(iris2$Species)[3] <- "XXX"
df <- diff_data(iris, iris2)
df
render_diff(df)

(Note that the last command render_diff(c) generates and displays a HTML page that has additional features that it might be worth moving into your codebase.)

And the output on my system:

gwarnes@F5KSH06HF9VN:/tmp$ R

R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin13.4.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> devtools::install_github("edwindj/daff")
Downloading GitHub repo edwindj/daff@master
from URL https://api.github.com/repos/edwindj/daff/zipball/master
Installing daff
trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/jsonlite_1.4.tgz'
Content type 'application/x-gzip' length 1077372 bytes (1.0 MB)
==================================================
downloaded 1.0 MB

Installing jsonlite
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ  \
  --no-save --no-restore --quiet CMD INSTALL  \
  '/private/var/folders/gc/c3c2p5_d4td159rblkqbp4s1xjhdh_/T/RtmpPi3Rqh/devtoolsc4de4a4865b/jsonlite'  \
  --library='/Users/gwarnes/Library/R/3.3/library' --install-tests

* installing *binary* package ‘jsonlite’ ...
* DONE (jsonlite)
trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/V8_1.4.tgz'
Content type 'application/x-gzip' length 2304654 bytes (2.2 MB)
==================================================
downloaded 2.2 MB

Installing V8
trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/Rcpp_0.12.10.tgz'
Content type 'application/x-gzip' length 3020988 bytes (2.9 MB)
==================================================
downloaded 2.9 MB

Installing Rcpp
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ  \
  --no-save --no-restore --quiet CMD INSTALL  \
  '/private/var/folders/gc/c3c2p5_d4td159rblkqbp4s1xjhdh_/T/RtmpPi3Rqh/devtoolsc4de5618b221/Rcpp'  \
  --library='/Users/gwarnes/Library/R/3.3/library' --install-tests

* installing *binary* package ‘Rcpp’ ...
* DONE (Rcpp)
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ  \
  --no-save --no-restore --quiet CMD INSTALL  \
  '/private/var/folders/gc/c3c2p5_d4td159rblkqbp4s1xjhdh_/T/RtmpPi3Rqh/devtoolsc4de284a1564/V8'  \
  --library='/Users/gwarnes/Library/R/3.3/library' --install-tests

* installing *binary* package ‘V8’ ...
* DONE (V8)
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ  \
  --no-save --no-restore --quiet CMD INSTALL  \
  '/private/var/folders/gc/c3c2p5_d4td159rblkqbp4s1xjhdh_/T/RtmpPi3Rqh/devtoolsc4de22b005a4/edwindj-daff-a5a97e1'  \
  --library='/Users/gwarnes/Library/R/3.3/library' --install-tests

* installing *source* package ‘daff’ ...
** R
** inst
** tests
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (daff)
> library(daff)
> iris2 <- iris
> levels(iris2$Species)
[1] "setosa"     "versicolor" "virginica"
> levels(iris2$Species)[3] <- "XXX"
> df <- diff_data(iris, iris2)
> df
Daff Comparison: ‘iris’ vs. ‘iris2’
  First 6 and last 6 patch lines:
    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
... ...          ...         ...          ...         ...
    5.7          2.8         4.1          1.3         versicolor
->  6.3          3.3         6            2.5         virginica->XXX
+++ 5.8          2.7         5.1          1.9         XXX
--- 5.8          2.7         5.1          1.9         virginica
->  7.1          3           5.9          2.1         virginica->XXX
... ...          ...         ...          ...         ...
->  6.7          3.3         5.7          2.5         virginica->XXX
->  6.7          3           5.2          2.3         virginica->XXX
->  6.3          2.5         5            1.9         virginica->XXX
->  6.5          3           5.2          2           virginica->XXX
->  6.2          3.4         5.4          2.3         virginica->XXX
->  5.9          3           5.1          1.8         virginica->XXX

> render_diff(df)
>

image

gwarnes-mdsol avatar Apr 20 '17 16:04 gwarnes-mdsol

Hi @paulfitz,

I think I'm facing the same issue here. The update does not seems to work with the same use case.

Example:

  • CSV file with 5 columns
  • Duplicated data for columns A and B on some rows
  • Update C on one of these row -> Ends up with added/deleted instead of updated
  • NB: If I update a C cell where A and B are not duplicated, the update is detected

I've tried to play with the --id flag, but didn't managed to find a way to always make it work

Any idea ? Thanks

FYI, I'm using daff cli 1.3.25 (JS)

selcham avatar May 15 '17 09:05 selcham

I dropped a line in the R code above. I've fixed above, but I'm also posting it here for clarity:

install.packages("devtools")
devtools::install_github("edwindj/daff")
library(daff)
iris2 <- iris
levels(iris2$Species)[3] <- "XXX"
df <- diff_data(iris, iris2)
df
render_diff(df)

gwarnes-mdsol avatar Jun 05 '17 15:06 gwarnes-mdsol

Hi @paulfitz, do you think you have time to look at this issue ? Thanks

selcham avatar Jun 09 '17 14:06 selcham

https://twitter.com/miketaylr/status/873175465321783296

paulfitz avatar Jun 09 '17 15:06 paulfitz

Simple Example:

ir table:

"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
5.8,2.7,5.1,1.9,"virginica"
5.8,2.7,5.1,1.9,"virginica"

ir2 table:

"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
5.8,2.7,5.1,1.9,"XXX"
5.8,2.7,5.1,1.9,"XXX"

Comparison:

> diff_data(ir, ir2)
Daff Comparison: 'ir' vs. 'ir2' 
    Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
+++ 5.8          2.7         5.1          1.9         XXX      
+++ 5.8          2.7         5.1          1.9         XXX      
--- 5.8          2.7         5.1          1.9         virginica
--- 5.8          2.7         5.1          1.9         virginica

gwarnes-mdsol avatar Jul 06 '17 17:07 gwarnes-mdsol

I'm getting a similar problem with columns--in a table where some columns have duplicate data of other columns, changing a column header, even if it's a column that does not have duplicated data, shows up as an added and deleted column. Using the bridge example on the demo page, change the Designer column so that it's identical to the Bridge column in both the original and the modified version. Then, in the modified version, change Length to something like Span. The Length/Span column appears as added/removed. fireshot capture 22 - daff - data diffs in javascript ruby pyth_ - http___paulfitz github io_daff_

miachenmtl avatar Oct 14 '17 00:10 miachenmtl