pipediff icon indicating copy to clipboard operation
pipediff copied to clipboard

Bad Performance with greater datasets of chr variables?

Open samkulu opened this issue 1 year ago • 4 comments

I am huge fan of your work and creative thinking!

The provided example works perfectly. Unfortunately my own datasets and pipes are processed excruciatingly slow. I am wondering how I could pimp the performance in order to get a bearable working tool for pipediffs.

My dataset data is 3076 x 42 and contains a lot of chr variables like CASENUMBER, TITLE, COUNTRY, DECISION, EMA_TYPE etc. In the variable typ is the stored character called "EMA_TYPE". The used pipe-example reads as follows:

 stat <- data %>%
              dplyr::group_by(across(any_of(typ)), isAMOUNT) %>%
              dplyr::summarise(COUNT = n()) %>%
              tidyr::spread(isAMOUNT, COUNT) %>%
              stat_rowsum() %>% stat_colsum() %>%
              dplyr::rename(AMOUNT= `TRUE`, MISSING = `FALSE`)
  1. The loading of dplyr was properly done with library(dplyr, warn = FALSE). Then I have used devtools::load_all(".") to load all namespaces (meaning required pgks) of my own pkg.
  2. The function-variable %>% was correctly by-passed with pipediff::pipediff()
  3. After patiently wainting to run through the pipes above I received the following warnings in the end:
Press ENTER to continue...
`summarise()` has grouped output by 'EMA_TYPE'. You can override using the `.groups` argument.
Press ENTER to continue...
Press ENTER to continue...
Press ENTER to continue...
Press ENTER to continue...
Press ENTER to continue...
Warning messages:
1: Exceeded diff limit during diff computation (130265 vs. 50000 allowed); overall diff is likely not optimal 
2: Exceeded diff limit during diff computation (262363 vs. 50000 allowed); in-hunk word diff is likely not optimal 

Any help or clue is greatly appreciated! Once the performance is decent, pipediff will become the most used package for me, next to dplyr.

samkulu avatar Jun 21 '23 09:06 samkulu

Hi @samkulu and thanks so much for the kind words.

  • The warning messages say that the diff is huge
  • {pipediff} uses diffobj::diffPrint() which itself compares the output of printing objects
  • The maximum printed is capped by getOption("max.print"), which on my system, using studio, is 1000
  • However I see in ?options that the normal default value of getOption("max.print") is 99999, and I can verify it in the R GUI (meaning RStudio overrides it)

So my suspicion is that :

  • Either you're not using RStudio
  • You define options(max.print=bignumber) in your project or R profile

Is it one of those ?

moodymudskipper avatar Jun 22 '23 08:06 moodymudskipper

A problem of printing less is that we look at the diff on what's printed not everything, so this might not always be satisfying.

Luckily we can also tweak some other behavior of diffobj::diffPrint(), all the args on ?diffobj::diffPrint that default to a gdo() call are defaulting on options.

In particular you might play with :

  • options(diffobj.max.diffs=)

integer(1L), number of differences (default 50000L) after which we abandon the O(n^2) diff algorithm in favor of a naive O(n) one. Set to -1L to stick to the original algorithm up to the maximum allowed (~INT_MAX/4).

  • options(diffobj.line.limit =)

integer(2L) or integer(1L), if length 1 how many lines of output to show, where -1 means no limit. If length 2, the first value indicates the threshold of screen lines to begin truncating output, and the second the number of lines to truncate to, which should be fewer than the threshold. Note that this parameter is implemented on a best-efforts basis and should not be relied on to produce the exact number of lines requested. In particular do not expect it to work well for for values small enough that the banner portion of the diff would have to be trimmed. If you want a specific number of lines use [ or head / tail. One advantage of line.limit over these other options is that you can combine it with context="auto" and auto max.level selection (the latter for diffStr), which allows the diff to dynamically adjust to make best use of the available display lines. [, head, and tail just subset the text of the output.

  • options(diffobj.hunk.limit =)

integer(2L) or integer (1L), how many diff hunks to show. Behaves similarly to line.limit. How many hunks are in a particular diff is a function of how many differences, and also how much context is used since context can cause two hunks to bleed into each other and become one.


moodymudskipper avatar Jun 22 '23 09:06 moodymudskipper

Dear @moodymudskipper

You are absolutely right: getOption("max.print") is limited by default to 1'000 and getOption("diffobj.max.diffs") has already a very large number of 50'000.

Honestly speaking, I am not a huge fan of adjusting default options. My problems actually arose only with too much columns. Many of them I want to keep for later evaluations. It was not sufficient to limit the dataset just by rows with head(data, 999). But luckily all problems were gone when I just omitted the obsolete columns for a single evaluation.

# Still very bad performance with 
data <- head(data, 999) 

# Good performance with 
data <- head(data, 999) %>% select(EMA_TYP, isAMOUNT)

# Same good results with actually needed columns in dataset for single evaluation
data <- data %>% select(EMA_TYP, isAMOUNT)

# Already a significant slow down with more columns for later evalutions
data <- data %>% select(EMA_TYP, isAMOUNT, CURRENCY, COUNTRY, YEAR)

Having this in mind, I have tested the following and sadly the performance still was bad all over the process. Even after having passed the problematic step. Having many columns, brings me easily to the limits. Not beeing able to skip a step is a major drawback.

 stat <- data %>%
             # single needed skip for good performance
              select(any_of(typ), isBETRAG) %>%   
             # performance is not improving, but keeps beeing bad
              dplyr::group_by(across(any_of(typ)), isAMOUNT) %>%
              dplyr::summarise(COUNT = n()) %>%
              tidyr::spread(isAMOUNT, COUNT) %>%
              stat_rowsum() %>% stat_colsum() %>%
              dplyr::rename(AMOUNT= `TRUE`, MISSING = `FALSE`)

Maybe there is still room for improvements in the great package pipediff. The user should not have to worry about the data size. So I have some tiny recommendations which could help solving this kind of user problems:

  1. Simply stop debugging if it is too big. Or
  2. Actually it turns out, that only one step was making trouble. Being able to skip this pipe, everything would work perfectly. Having an option e.g. pipediff::pipediff(skip = 1:2) would be great.
  3. Show a warning of the reached limits and suggest to the user to shrink the dataset accordingly in the columns only and in the rows as well if necessary.
  4. Because pipediff is without doubt a debugging tool, you could make a break with a preset of options for big datasets. I dont think that anyboy really wants to scroll down 50'000 rows in the Viewer panel.

By any means, avoid a bad performance. Your package pipediff is far to good that anybody should have a long wait with the red stopping sign or needs to terminate an R session. As far as I am concerned, the package is the best pipe-analyzer for seeing the dataset transformations. I am just too lazy and don't want to worry having a too big dataset.

samkulu avatar Jun 22 '23 14:06 samkulu

Thanks, the thing is that the problem is not the big data I think, but the big diff, so I think adjusting the options is the way to go, but it could be done through pipediff::pipediff(max_whatever = n) with maybe a more sensible default than the default behavior.

Could you please send me a RDS file of your data so I can reproduce and play around ?

Le jeu. 22 juin 2023 à 16:07, samkulu @.***> a écrit :

Dear @moodymudskipper https://github.com/moodymudskipper

You are absolutely right: getOption("max.print") is limited by default to 1'000 and getOption("diffobj.max.diffs") has already a very large number of 50'000.

Honestly speaking, I am not a huge fan of adjusting default options. My problems actually arose only with too much columns. Many of them I want to keep for later evaluations. It was not sufficient to limit the dataset just by rows with head(data, 999). But luckily all problems were gone when I just omitted the obsolete columns for a single evaluation.

Still very bad performance with

data <- head(data, 999)

Good performance with

data <- head(data, 999) %>% select(EMA_TYP, isAMOUNT)

Same good results with actually needed columns in dataset for single evaluation

data <- data %>% select(EMA_TYP, isAMOUNT)

Already a significant slow down with more columns for later evalutions

data <- data %>% select(EMA_TYP, isAMOUNT, CURRENCY, COUNTRY, YEAR)

Having this in mind, I have tested the following and sadly the performance still was bad all over the process. Even after having passed the problematic step. Having many columns, brings me easily to the limits. Not beeing able to skip a step is a major drawback.

stat <- data %>% # single needed skip for good performance select(any_of(typ), isBETRAG) %>% # performance is not improving, but keeps beeing bad dplyr::group_by(across(any_of(typ)), isAMOUNT) %>% dplyr::summarise(COUNT = n()) %>% tidyr::spread(isAMOUNT, COUNT) %>% stat_rowsum() %>% stat_colsum() %>% dplyr::rename(AMOUNT= TRUE, MISSING = FALSE)

Maybe there is still room for improvements in the great package pipediff. The user should not have to worry about the data size. So I have some tiny recommendations which could help solving this kind of user problems:

  1. Simply stop debugging if it is too big. Or
  2. Actually it turns out, that only one step was making trouble. Being able to skip this pipe, everything would work perfectly. Having an option e.g. pipediff::pipediff(skip = 1:2) would be great.
  3. Show a warning of the reached limits and suggest to the user to shrink the dataset accordingly in the columns only and in the rows as well if necessary.
  4. Because pipediff is without doubt a debugging tool, you could make a break with a preset of options for big datasets. I dont think that anyboy really wants to scroll down 50'000 rows in the Viewer panel.

By any means, avoid a bad performance. Your package pipediff is far to good that anybody should have a long wait with the red stopping sign or needs to terminate an R session. As far as I am concerned, the package is the best pipe-analyzer for seeing the dataset transformations. I am just too lazy and don't want to worry having a too big dataset.

— Reply to this email directly, view it on GitHub https://github.com/moodymudskipper/pipediff/issues/5#issuecomment-1602706913, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEMAMYR7NPMEVRS7C3UTV7LXMRGTVANCNFSM6AAAAAAZON4IH4 . You are receiving this because you were mentioned.Message ID: @.***>

moodymudskipper avatar Jun 22 '23 14:06 moodymudskipper