data-morph icon indicating copy to clipboard operation
data-morph copied to clipboard

Cache various statistics to improve performance

Open JCGoran opened this issue 1 year ago • 2 comments

An implementation that resolves #201.

Describe your changes

This is a rather large-looking PR (only because I am basing this off of https://github.com/stefmolin/data-morph/pull/198 for ease-of-merge!), and is a sort of a first attempt (which is why I am marking it as a draft) at caching the various statistics properties as described in #201.

TL;DR: since at each iteration we only move one point in the dataset (basically map $(x_i, y_i) \mapsto (x'_i, y'_i)$ ), why not compute the new statistics ($\langle X' \rangle$, $\text{Var}(X') $, and $r(X', Y') $) in terms of the old ones ($\langle X \rangle$, $\text{Var}(X) $, and $r(X, Y) $) + the old ($(x_i, y_i) $) and new ($(x'_i, y'_i) $) points? This is essentially what this PR does.

Details

Mathematical derivation

The mean is shifted as:

$$ \langle X' \rangle = \langle X \rangle + \frac{\delta_x}{n} $$

where $\delta_x = x'_i - x_i$ (same for $Y$), the variance (or the standard deviation if you want) is shifted as:

$$ \text{Var}(X') = \text{Var}(X) + 2 \frac{\delta_x}{n} (x_i - \langle X \rangle) + \frac{\delta_x^2}{n} - \frac{\delta_x^2}{n^2} $$

while the correlation coefficient is shifted as:

$$ r(X', Y') = \frac{\langle X Y \rangle + \frac{1}{n}(\delta_x y_i + \delta_y x_i + \delta_x \delta_y) - \langle X' \rangle \langle Y' \rangle}{\sqrt{\text{Var}(X') \text{Var}(Y')}} $$

where the only new quantity to keep track of is $\langle X Y \rangle$ (the rest can be obtained from the above 2).

Workflow

With the math out of the way, this is how the new way of computing the statistics works:

  • read the (x, y) dataset, and make an instance of Statistics (subject to change for its somewhat ambiguous naming) from it
  • start the iteration
  • Statistics has a method, perturb, which takes in 3 arguments: the row (index) we wish to perturb, and the 2 values of the perturbations (in the x and y directions), and returns an instance of the new SummaryStatistics (i.e. the one from the perturbed data)
  • if _is_close_enough returns True, we call the perturb method again (it's $O(1)$ so not very expensive to call), but this time with update=True, which actually updates the data in the Statistics instance
  • go to the next iteration

Note that there's a bunch of new methods (shifted_mean, shifted_var, shifted_stdev, shifted_corrcoef), which we may or may not want to make private (or refactor so they accept different arguments), as I didn't put too much thought into the overall API design (again, hence the "draft" status).

Performance

Now, onto the performance results! Note that the benchmarks were done with #201 already in there, so the absolute numbers differ from the ones on main, but the reduction in the number of function calls is evident. I tested everything using python -m cProfile -m data_morph --seed 42 --start-shape panda --target-shape [shape].

# star shape - perf before
79257097 function calls (78610163 primitive calls) in 43.075 seconds
# star shape - perf after
54021781 function calls (52470658 primitive calls) in 28.704 seconds
# rings shape - perf before
94095798 function calls (93420792 primitive calls) in 43.998 seconds
# rings shape - perf after
68846080 function calls (67219947 primitive calls) in 29.592 seconds
# bullseye shape - perf before
82412199 function calls (81796202 primitive calls) in 40.245 seconds
# bullseye shape - perf after
57109757 function calls (55635954 primitive calls) in 26.200 seconds

so we use 26.1M less function calls for all shapes in question, which results in about 35% faster performance (computed as (t_old - t_new) / t_old) * 100).

Discussion items

Some possible points of discussion:

  • I didn't pay too much attention to Bessel's correction in the derivation of the formulas above, though I imagine the numerical difference should be mostly negligible since all of the starting datasets have > 100 points
  • API design (make shifted_* functions private? Could perturb do without an update argument? etc.)
  • the tests for shifted_* functions are a bit repetitive, and could take a while to run (I was initially very skeptical of the numerical stability, but it appears to be fine, though it could just be that I haven't tried enough scenarios to hit an instability)

Feedback is welcome :)

Checklist

  • [x] Test cases have been modified/added to cover any code changes.
  • [x] Docstrings have been modified/created for any code changes.
  • [x] All linting and formatting checks pass (see the contributing guidelines for more information).

JCGoran avatar Jul 21 '24 14:07 JCGoran

Codecov Report

Attention: Patch coverage is 96.47059% with 9 lines in your changes missing coverage. Please review.

Project coverage is 98.21%. Comparing base (5d0486b) to head (656c96d).

Files with missing lines Patch % Lines
src/data_morph/data/stats.py 91.34% 6 Missing and 3 partials :warning:
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #204      +/-   ##
==========================================
- Coverage   98.47%   98.21%   -0.27%     
==========================================
  Files          43       43              
  Lines        1839     2075     +236     
  Branches      114      129      +15     
==========================================
+ Hits         1811     2038     +227     
- Misses         25       31       +6     
- Partials        3        6       +3     
Files with missing lines Coverage Δ
src/data_morph/data/dataset.py 73.75% <100.00%> (+0.67%) :arrow_up:
src/data_morph/morpher.py 100.00% <100.00%> (ø)
src/data_morph/plotting/static.py 100.00% <100.00%> (ø)
tests/data/test_stats.py 100.00% <100.00%> (ø)
tests/test_morpher.py 100.00% <100.00%> (ø)
src/data_morph/data/stats.py 91.66% <91.34%> (-8.34%) :arrow_down:

codecov[bot] avatar Nov 11 '24 17:11 codecov[bot]

Regarding the median: I was not satisfied with the performance of pure Python implementations of sorted containers which have $O(\log n)$ time complexity for the various ops (add, remove, search), so I decided to implement my own, written in C++ (since std::map is basically a sorted version of Python's dict, with the tradeoff that ops are $O(\log n)$ and not $O(1)$ ) with bindings to Python using nanobind.

JCGoran avatar Nov 12 '24 12:11 JCGoran