DataGenerator
DataGenerator copied to clipboard
Adding statistic feature
During hahaton I worked on statistic feature. This feature can save min / max / ... value for some property.
What do you think, do we need it in DG?
I'm thinking of something like:
<assign name="var_1" expr="#{Yuka}" statistic="max"
<assign name="var_2" expr="#{Buka}" statistic="min,average" />
With this feature, DG can not only generate data, but solve some problems!
What do you think?
If you like this, what feature / syntax / modularity can you propouse?
P.S. During hatatone I implemented calculation of max for some specific property. So, i can't just push it, it's necessary to use custom tags + min /... features + unit tests + docs + examples.
Good idea. I would recommend implementing it as an incremental statistics plugin.
All you need to keep track of is sum of X, sum of X^2 and the count of points.
Out of that you can calculate the mean at any time = X / count, variance = (sum of X^2) / n - mean^2 as well as being able to merge the statistics together from various mappers - in case it was executed on multiple nodes. In this case the variance is biased - i.e. computed using dividing by n instead of n-1
BTW, dropwizard metrics (previously yammer metrics) implement lots of stream-like statistics. https://dropwizard.github.io/metrics/3.1.0/ We might not want to use it as a dependency, but he implemented a stream median, p99, p999 ...etc that most likely we can reuse for that purpose. Look at his histogram ( https://dropwizard.github.io/metrics/3.1.0/getting-started/#histograms ).
A sample histogram result: "MyHistogram": { "type": "histogram", "count": 2041364, "min": 0, "max": 65202, "mean": 132.21681287609658, "std_dev": 444.84831733762644, "median": 44, "p75": 118, "p95": 457.39999999999964, "p98": 922.8599999999986, "p99": 1145.4000000000015, "p999": 1715.594 } },
Need to discuss further to define the scope.