timescaledb-toolkit icon indicating copy to clipboard operation
timescaledb-toolkit copied to clipboard

Error on stats_agg with moving aggregate mode can be large, performance is good.

Open davidkohn88 opened this issue 2 years ago • 2 comments

Describe the bug Queries like SELECT stddev(stats_agg(price) OVER (PARTITION BY symbol ORDER BY time RANGE '15 min' PRECEDING) FROM stocks_real_time WHERE symbol = 'AAPL' while significantly more performant than the equivalent Postgres operation (SELECT stddev(price) OVER (PARTITION BY symbol ORDER BY time RANGE '15 min' PRECEDING FROM stocks_real_time WHERE symbol = 'AAPL') can produce more error due to double precision arithmetic.

You can find the "real" value by using: SELECT stddev(price::numeric) OVER (PARTITION BY symbol ORDER BY time RANGE '15 min' PRECEDING FROM stocks_real_time WHERE symbol = 'AAPL' which is reasonably performant, while also producing a smaller amount of error.

We'd like to reduce the error and have a performant moving aggregate mode, finding a better tradeoff by using higher precision floating point values or perhaps calling the numeric equivalents could be useful.

davidkohn88 avatar Oct 25 '22 21:10 davidkohn88

Notes from our discussion yesterday:

  • Using a TwoFloat (pair of f64s) internally when in rolling aggregate mode increases accuracy by around a few orders of magnitude, but is ~30% slower to create (benchmarked using pgbench). Here's a query we were using to test the accuracy of TwoFloat compared to numeric and the current implementation
  • Using a Postgres numeric gives us the most accurate results possible (I use it as a source of truth for the correct values so I assume it is always exactly correct), but is a bit slower than TwoFloat.

When creating a stats aggregate in rolling aggregate mode we always convert it to a normal f64-based aggregate in the final function, since the extra precision is much more useful for preserving accuracy in the transition function than in the functions that ultimately get called on the aggregate.

My current work to make the internal aggregate data generic over the data type, and add stats_agg_tf which uses TwoFloats internally is on the sv/stats_agg-twofloat branch.

syvb avatar Oct 26 '22 12:10 syvb

#599 added support for using TwoFloats internally when stats_agg is in moving aggregate mode. I'll keep this issue open to track adding support for having stats_agg be able to use numeric to keep track of values as well.

syvb avatar Nov 08 '22 18:11 syvb