tantivy
tantivy copied to clipboard
Issue 1787 extended stats
Here is a pull request for #1787 Some notes:
- the "simple" stats are calculated via ExtendedStats (with the same logic used for Min, Max, etc)
- the tests use an approximate comparison using an extended versione of the macro
assert_nearly_equalswhich accepts an extra parameter specifying the max allowed difference between the two values - The algorithm for calculating the variance is Welford's online algorithm with a modification for calculating the mean since the code can use the sum (there is a comment in the code)
- The sum is computed using the Kahan algorithm (as in elastic search)
- The value sum_of_squares of extended stats is the same value computed by elastic search even if sum of squares has a different meaning. The "real" sum of squares is in the code so if you decide to expose it the change is straightforward
- For a specific data set I made a comparison with the numbers returned by elastic search, here is a simple summary:
| variance | variance_sampling | std_deviation | std_deviation_sampling | lower | lower_sampling | upper | upper_sampling | sum of squares | |
|---|---|---|---|---|---|---|---|---|---|
| ES | 9.138888888888891 | 10.966666666666669 | 3.023059524536176 | 3.3115957885386114 | -0.8794523824056855 | -1.4565249104105558 | 11.21278571573902 | 11.78985824374389 | 215.0 |
| Tantivy | 9.138888888888888 | 10.966666666666663 | 3.023059524536176 | 3.311595788538611 | -0.8794523824056837 | -1.4565249104105549 | 11.212785715739017 | 11.78985824374389 | 215.0 |
- I also did a comparison using values where the mean is very similar to the variance and Tantivy calculations are more precise (I used Excel as a reference for obtaining the value) because Elastic Search does not implement the Welford Algorithm.
@PSeitz can you review?
This is just a ping. I'm working on it, I created a dataset and a simple application that uses several gb of memory for computing some statistics. I'm using the main branch as the memory consumption reference and working on a solution for keeping it the same.
This is just a ping. I'm working on it, I created a dataset and a simple application that uses several gb of memory for computing some statistics. I'm using the main branch as the memory consumption reference and working on a solution for keeping it the same.
Just as a heads up, please no macros. Generics would be fine, if the complexity stays low. Otherwise we just can keep it seperate
Some notes:
- I used a
Box<ExtendedStats>because the size of the enum without the box was impacting the memory usage of all the stat searches. In my experiments a term aggregation without h the box wsa consuming about 15 more ram than the actual version. - I splitted Stats and ExtendedStats, the latter is using the former
- I introduced a new intermediate struct that makes the
SegmentStatsCollectormore flexible making it work with both stats and extendedstats; in future, if needed, you can implement the single stats (min,max,avg) independently with a minimum effort.
Hello, I'd like to know how to proceed with this pull request, it has been pending for a while.
Hello, I'd like to know how to proceed with this pull request, it has been pending for a while.
Sorry I forgot about this PR. It's fine to ping occasionally. Can you resolve the merge conflicts? I consider adding a benchmark that also reports memory consumption for the aggregation benchmarks
Merge is done, do you have an example for benchmark that also reports memory consumption? I ran cargo bench but saw no indication of memory consumption.
Merge is done, do you have an example for benchmark that also reports memory consumption? I ran
cargo benchbut saw no indication of memory consumption.
Thanks! memory reporting is coming with this PR: https://github.com/quickwit-oss/tantivy/pull/2378
Hi, quick update: I added an extended stats bench in the corresponding file, my fork is merged with main; I'm waiting for #2378 to be merged into main before aligning with with it and then pushing it to this pr.
https://github.com/quickwit-oss/tantivy/pull/2378 is now merged
new version merged with main plus an update bench for extendedstats
The memory consumption increased by 10% for those two benchmarks. I don't think you should pay for extended stats if you are not using it
terms_many_with_avg_sub_agg Memory: 29.0 MB (+9.01%) Avg: 131.7313ms (-3.79%) Median: 125.7534ms (+3.04%) [118.0482ms .. 180.8836ms]
terms_many_json_mixed_type_with_sub_agg_card Memory: 44.0 MB (+9.95%) Avg: 176.1016ms (+13.39%) Median: 161.8976ms (+8.52%) [151.0041ms .. 251.0137ms]
Hi, I dug into the issue and there 2 reason for the additional memory consumption:
- I created a unified SegmentStatsCollector (both for stats and extendedstats) that requires more memory. I developed a different solution i.e. a dedicated SegmentExtendedStatsCollector that duplicates part of the code, but does not allocate more memory if you use the regular stats. This changes reduces the memory increase to 5% (i.e. it halves the memory increment from main)
- I modified the stats to use Kahan summation aglorithm, because it's more precise. The new algorithm requires an extra f64, that is responsible for the remaining 5% memory increase. I can revert the stats implementation, but you are losing something in terms of precision (Elasticsearch uses Kahan summation).
Should I revert even the modification on stats in order to restore the previous memory usage?
Thanks
Hi, thanks a lot for your review, this submission contains the code modifications. This version still shows a memory usage increase (halved since the last submission) because of the introduction of Kahan summation in stats.
This version should reflect the changes you requested with the exception of the stat labels; they seem to me the same as ElasticSearch