tantivy icon indicating copy to clipboard operation
tantivy copied to clipboard

Issue 1787 extended stats

Open giovannicuccu opened this issue 2 years ago • 10 comments
trafficstars

Here is a pull request for #1787 Some notes:

  • the "simple" stats are calculated via ExtendedStats (with the same logic used for Min, Max, etc)
  • the tests use an approximate comparison using an extended versione of the macro assert_nearly_equals which accepts an extra parameter specifying the max allowed difference between the two values
  • The algorithm for calculating the variance is Welford's online algorithm with a modification for calculating the mean since the code can use the sum (there is a comment in the code)
  • The sum is computed using the Kahan algorithm (as in elastic search)
  • The value sum_of_squares of extended stats is the same value computed by elastic search even if sum of squares has a different meaning. The "real" sum of squares is in the code so if you decide to expose it the change is straightforward
  • For a specific data set I made a comparison with the numbers returned by elastic search, here is a simple summary:
variance variance_sampling std_deviation std_deviation_sampling lower lower_sampling upper upper_sampling sum of squares
ES 9.138888888888891 10.966666666666669 3.023059524536176 3.3115957885386114 -0.8794523824056855 -1.4565249104105558 11.21278571573902 11.78985824374389 215.0
Tantivy 9.138888888888888 10.966666666666663 3.023059524536176 3.311595788538611 -0.8794523824056837 -1.4565249104105549 11.212785715739017 11.78985824374389 215.0
  • I also did a comparison using values where the mean is very similar to the variance and Tantivy calculations are more precise (I used Excel as a reference for obtaining the value) because Elastic Search does not implement the Welford Algorithm.

giovannicuccu avatar Nov 11 '23 13:11 giovannicuccu

@PSeitz can you review?

fulmicoton avatar Nov 12 '23 01:11 fulmicoton

This is just a ping. I'm working on it, I created a dataset and a simple application that uses several gb of memory for computing some statistics. I'm using the main branch as the memory consumption reference and working on a solution for keeping it the same.

giovannicuccu avatar Nov 19 '23 07:11 giovannicuccu

This is just a ping. I'm working on it, I created a dataset and a simple application that uses several gb of memory for computing some statistics. I'm using the main branch as the memory consumption reference and working on a solution for keeping it the same.

Just as a heads up, please no macros. Generics would be fine, if the complexity stays low. Otherwise we just can keep it seperate

PSeitz avatar Nov 19 '23 11:11 PSeitz

Some notes:

  • I used a Box<ExtendedStats> because the size of the enum without the box was impacting the memory usage of all the stat searches. In my experiments a term aggregation without h the box wsa consuming about 15 more ram than the actual version.
  • I splitted Stats and ExtendedStats, the latter is using the former
  • I introduced a new intermediate struct that makes the SegmentStatsCollector more flexible making it work with both stats and extendedstats; in future, if needed, you can implement the single stats (min,max,avg) independently with a minimum effort.

giovannicuccu avatar Nov 24 '23 13:11 giovannicuccu

Hello, I'd like to know how to proceed with this pull request, it has been pending for a while.

giovannicuccu avatar Apr 15 '24 10:04 giovannicuccu

Hello, I'd like to know how to proceed with this pull request, it has been pending for a while.

Sorry I forgot about this PR. It's fine to ping occasionally. Can you resolve the merge conflicts? I consider adding a benchmark that also reports memory consumption for the aggregation benchmarks

PSeitz avatar Apr 15 '24 14:04 PSeitz

Merge is done, do you have an example for benchmark that also reports memory consumption? I ran cargo bench but saw no indication of memory consumption.

giovannicuccu avatar Apr 20 '24 10:04 giovannicuccu

Merge is done, do you have an example for benchmark that also reports memory consumption? I ran cargo bench but saw no indication of memory consumption.

Thanks! memory reporting is coming with this PR: https://github.com/quickwit-oss/tantivy/pull/2378

PSeitz avatar Apr 26 '24 10:04 PSeitz

Hi, quick update: I added an extended stats bench in the corresponding file, my fork is merged with main; I'm waiting for #2378 to be merged into main before aligning with with it and then pushing it to this pr.

giovannicuccu avatar May 04 '24 09:05 giovannicuccu

https://github.com/quickwit-oss/tantivy/pull/2378 is now merged

PSeitz avatar May 07 '24 09:05 PSeitz

new version merged with main plus an update bench for extendedstats

giovannicuccu avatar May 11 '24 09:05 giovannicuccu

The memory consumption increased by 10% for those two benchmarks. I don't think you should pay for extended stats if you are not using it

terms_many_with_avg_sub_agg                     Memory: 29.0 MB  (+9.01%)    Avg: 131.7313ms  (-3.79%)     Median: 125.7534ms  (+3.04%)    [118.0482ms .. 180.8836ms]
terms_many_json_mixed_type_with_sub_agg_card    Memory: 44.0 MB  (+9.95%)    Avg: 176.1016ms  (+13.39%)    Median: 161.8976ms  (+8.52%)    [151.0041ms .. 251.0137ms]

PSeitz avatar May 15 '24 04:05 PSeitz

Hi, I dug into the issue and there 2 reason for the additional memory consumption:

  1. I created a unified SegmentStatsCollector (both for stats and extendedstats) that requires more memory. I developed a different solution i.e. a dedicated SegmentExtendedStatsCollector that duplicates part of the code, but does not allocate more memory if you use the regular stats. This changes reduces the memory increase to 5% (i.e. it halves the memory increment from main)
  2. I modified the stats to use Kahan summation aglorithm, because it's more precise. The new algorithm requires an extra f64, that is responsible for the remaining 5% memory increase. I can revert the stats implementation, but you are losing something in terms of precision (Elasticsearch uses Kahan summation).

Should I revert even the modification on stats in order to restore the previous memory usage?

Thanks

giovannicuccu avatar May 18 '24 16:05 giovannicuccu

Hi, thanks a lot for your review, this submission contains the code modifications. This version still shows a memory usage increase (halved since the last submission) because of the introduction of Kahan summation in stats.

giovannicuccu avatar May 19 '24 13:05 giovannicuccu

This version should reflect the changes you requested with the exception of the stat labels; they seem to me the same as ElasticSearch

giovannicuccu avatar Jun 01 '24 10:06 giovannicuccu