tantivy Issue 1787 extended stats

trafficstars

Here is a pull request for #1787 Some notes:

the "simple" stats are calculated via ExtendedStats (with the same logic used for Min, Max, etc)
the tests use an approximate comparison using an extended versione of the macro assert_nearly_equals which accepts an extra parameter specifying the max allowed difference between the two values
The algorithm for calculating the variance is Welford's online algorithm with a modification for calculating the mean since the code can use the sum (there is a comment in the code)
The sum is computed using the Kahan algorithm (as in elastic search)
The value sum_of_squares of extended stats is the same value computed by elastic search even if sum of squares has a different meaning. The "real" sum of squares is in the code so if you decide to expose it the change is straightforward
For a specific data set I made a comparison with the numbers returned by elastic search, here is a simple summary:

	variance	variance_sampling	std_deviation	std_deviation_sampling	lower	lower_sampling	upper	upper_sampling	sum of squares
ES	9.138888888888891	10.966666666666669	3.023059524536176	3.3115957885386114	-0.8794523824056855	-1.4565249104105558	11.21278571573902	11.78985824374389	215.0
Tantivy	9.138888888888888	10.966666666666663	3.023059524536176	3.311595788538611	-0.8794523824056837	-1.4565249104105549	11.212785715739017	11.78985824374389	215.0

I also did a comparison using values where the mean is very similar to the variance and Tantivy calculations are more precise (I used Excel as a reference for obtaining the value) because Elastic Search does not implement the Welford Algorithm.

Nov 11 '23 13:11 giovannicuccu

@PSeitz can you review?

Nov 12 '23 01:11 fulmicoton

This is just a ping. I'm working on it, I created a dataset and a simple application that uses several gb of memory for computing some statistics. I'm using the main branch as the memory consumption reference and working on a solution for keeping it the same.

Nov 19 '23 07:11 giovannicuccu

This is just a ping. I'm working on it, I created a dataset and a simple application that uses several gb of memory for computing some statistics. I'm using the main branch as the memory consumption reference and working on a solution for keeping it the same.

Just as a heads up, please no macros. Generics would be fine, if the complexity stays low. Otherwise we just can keep it seperate

Nov 19 '23 11:11 PSeitz

Some notes:

I used a Box<ExtendedStats> because the size of the enum without the box was impacting the memory usage of all the stat searches. In my experiments a term aggregation without h the box wsa consuming about 15 more ram than the actual version.
I splitted Stats and ExtendedStats, the latter is using the former

I introduced a new intermediate struct that makes the SegmentStatsCollector more flexible making it work with both stats and extendedstats; in future, if needed, you can implement the single stats (min,max,avg) independently with a minimum effort.

Nov 24 '23 13:11 giovannicuccu

Hello, I'd like to know how to proceed with this pull request, it has been pending for a while.

Apr 15 '24 10:04 giovannicuccu

Hello, I'd like to know how to proceed with this pull request, it has been pending for a while.

Sorry I forgot about this PR. It's fine to ping occasionally. Can you resolve the merge conflicts? I consider adding a benchmark that also reports memory consumption for the aggregation benchmarks

Apr 15 '24 14:04 PSeitz

Merge is done, do you have an example for benchmark that also reports memory consumption? I ran cargo bench but saw no indication of memory consumption.

Apr 20 '24 10:04 giovannicuccu

Merge is done, do you have an example for benchmark that also reports memory consumption? I ran cargo bench but saw no indication of memory consumption.

Thanks! memory reporting is coming with this PR: https://github.com/quickwit-oss/tantivy/pull/2378

Apr 26 '24 10:04 PSeitz

Hi, quick update: I added an extended stats bench in the corresponding file, my fork is merged with main; I'm waiting for #2378 to be merged into main before aligning with with it and then pushing it to this pr.

May 04 '24 09:05 giovannicuccu

https://github.com/quickwit-oss/tantivy/pull/2378 is now merged

May 07 '24 09:05 PSeitz

new version merged with main plus an update bench for extendedstats

May 11 '24 09:05 giovannicuccu

The memory consumption increased by 10% for those two benchmarks. I don't think you should pay for extended stats if you are not using it

terms_many_with_avg_sub_agg                     Memory: 29.0 MB  (+9.01%)    Avg: 131.7313ms  (-3.79%)     Median: 125.7534ms  (+3.04%)    [118.0482ms .. 180.8836ms]
terms_many_json_mixed_type_with_sub_agg_card    Memory: 44.0 MB  (+9.95%)    Avg: 176.1016ms  (+13.39%)    Median: 161.8976ms  (+8.52%)    [151.0041ms .. 251.0137ms]

May 15 '24 04:05 PSeitz

Hi, I dug into the issue and there 2 reason for the additional memory consumption:

I created a unified SegmentStatsCollector (both for stats and extendedstats) that requires more memory. I developed a different solution i.e. a dedicated SegmentExtendedStatsCollector that duplicates part of the code, but does not allocate more memory if you use the regular stats. This changes reduces the memory increase to 5% (i.e. it halves the memory increment from main)
I modified the stats to use Kahan summation aglorithm, because it's more precise. The new algorithm requires an extra f64, that is responsible for the remaining 5% memory increase. I can revert the stats implementation, but you are losing something in terms of precision (Elasticsearch uses Kahan summation).

Should I revert even the modification on stats in order to restore the previous memory usage?

Thanks

May 18 '24 16:05 giovannicuccu

Hi, thanks a lot for your review, this submission contains the code modifications. This version still shows a memory usage increase (halved since the last submission) because of the introduction of Kahan summation in stats.

May 19 '24 13:05 giovannicuccu

This version should reflect the changes you requested with the exception of the stat labels; they seem to me the same as ElasticSearch

Jun 01 '24 10:06 giovannicuccu

tantivy tantivy copied to clipboard

Issue 1787 extended stats

tantivy
tantivy copied to clipboard