ggpointdensity icon indicating copy to clipboard operation
ggpointdensity copied to clipboard

discussion: use roxygen + use new bandwidth estimation (fixes erors caused by lack of variation) + interprete density in device coordinates + deprecate "knn"

Open jan-glx opened this issue 1 year ago • 2 comments

I have made a few changes in my fork that cover my needs and would be happy to prepare issues / PRs for the parts you think are a good idea/solution.

  • housekeeping
    • use roxygen (16d334df213921e5a55c097bc6fc90e35840e39e, d641221a7061a79a6aa80747ff03e42c4bb6a3a1)
    • use testthat ( 8f0a857dcc969a20064dd325e0d99ffaf1ebe710 , d17e5cccb2a0e226540ed044bf09c567b58989fd )
    • update code copied from ggplot2 (afcf92101d247fdec3d039caa0860c19a96808ce)
    • formatting (afcf92101d247fdec3d039caa0860c19a96808ce)
  • slightly breaking: fix edgecases by using new bandwidth estimation based on the expanded scale limits instead of the data range/standard deviation (3d75d67c0eccc3bf2d452d8c86540049c22e358a, a7bc1fb6ab40d152083668b6546aa10a9689d0e7)
  • slightly breaking: define density as density in device coordinates (i.e. in the plot you see, however it might be scaled) (b18b99f1b4cf111498f96496e5f2194696cc7406, 9b5e4c72604951cf52b2f56e432148df63bf7571)
  • deprecate the original "knn" based density estimation since it has fundamental issues with non-continous data and is anyway slower (50c2e203820f2cf90fdf921cd0460570bb80da49 (this actually just changes the default))

jan-glx avatar Dec 01 '23 14:12 jan-glx

Hi @jan-glx , thanks a lot for this massive PR! I finally found some time to prepare the release of a new CRAN version of this package, and now I'm deciding which new features to add in the next release.

The housekeeping stuff makes sense of course, but I have some questions regarding your other commits. It's been a few years since I wrote the original code so I forgot some details that would help me understand what you did :smile: So let me ask some questions to clarify and refresh my memory.

Regarding 3d75d67c0eccc3bf2d452d8c86540049c22e358a, what's the between using diff(scales$x$get_limits()) and scale_views$x.range to get the data range in practice? I was under the impression that scales$x$get_limits() does not return the data range, but rather the range of the plot limits. So shouldn't it be roughly equivalent to your solution, or did you find some case where this makes a difference?

Regarding b18b99f1b4cf111498f96496e5f2194696cc7406 and 9b5e4c72604951cf52b2f56e432148df63bf7571, I do believe the density should be calculated in device coordinates. So the rationale behind your changes makes perfect sense to me, and honestly I thought that's what my implementation was already doing. So I guess you again found some scenario where this is not the case, or my original implementation was buggy. Which one is it? I installed your fork and played around with it and works nicely, although the Warning: Actual plot aspect ratio does not match aspect.ratio of StatPointdensity might be a little annoying in day-to-day use. So do you think there is a reasonable default way to handle this? I haven't checked your code in great detail but would it be possible to automatically use the actual_aspect_ratio for density calculation?

Regarding 50c2e203820f2cf90fdf921cd0460570bb80da49, @slowkow benchmarked both methods a while ago (https://twitter.com/slowkow/status/1169047484716466176?t=bOaQ_Def4u2qz_9zwystiw&s=19) and found that the original method is faster (for small-ish data sets at least). So I guess it has some advantages. In any case, I want to overhaul the methods parameter in the long run but haven't made up my mind on the details yet, so I guess for now it can stay the way it currently is.

LKremer avatar May 16 '24 14:05 LKremer

Hi @LKremer great to see you finding some time for this nice package, unfortunately I don't have any spare time at the moment and hardly remember how I reached this PR. Let me give a few unverified short comments that might help making use of this work:

regarding 3d75d67c0eccc3bf2d452d8c86540049c22e358a : get_limits does not include expansion or the limits set at coordinate level; often the difference is not huge sometimes it is, most prominently when the data range has width 0, see links in 3d75d67c0eccc3bf2d452d8c86540049c22e358a#commitcomment-142102424)

Regarding https://github.com/LKremer/ggpointdensity/commit/b18b99f1b4cf111498f96496e5f2194696cc7406 and https://github.com/LKremer/ggpointdensity/commit/9b5e4c72604951cf52b2f56e432148df63bf7571, your dx/dy implementation does achieve that to some extent but only precisely if the aspect ratio is 1 (any maybe also only if there are no transformations and equal expansions).

There is no way to fix this completely automatically as the ggplot does not know on what device it will be be plotted at that time. If you resize the plot in Rstudio the aspect ratio changes (and thus the correct densities will be different) but no ggplot function is called during the resize. The easy way to fix this is to always use + theme(aspect.ratio=1) when using +geom_pointdensity, one workaround for the warning might be an option that a user can set to ignore if the aspect ratio is off for less than xx%

Regarding https://github.com/LKremer/ggpointdensity/commit/50c2e203820f2cf90fdf921cd0460570bb80da49 : yes knn is O(n log(n)) MASS:kde is O(n m^2) where m is the number of grid points for each axis, one could limit evaluation to the n points of interest for small n to speed it up for smaller datasets, but in the end only the case with many data-points is the one that matters, the one where one actually needs to wait. So while MASS:kde might not always be optimal it is always good.

jan-glx avatar May 16 '24 15:05 jan-glx