timescaledb-toolkit
timescaledb-toolkit copied to clipboard
Error on stats_agg with moving aggregate mode can be large, performance is good.
Describe the bug
Queries like SELECT stddev(stats_agg(price) OVER (PARTITION BY symbol ORDER BY time RANGE '15 min' PRECEDING) FROM stocks_real_time WHERE symbol = 'AAPL'
while significantly more performant than the equivalent Postgres operation (SELECT stddev(price) OVER (PARTITION BY symbol ORDER BY time RANGE '15 min' PRECEDING FROM stocks_real_time WHERE symbol = 'AAPL'
) can produce more error due to double precision arithmetic.
You can find the "real" value by using: SELECT stddev(price::numeric) OVER (PARTITION BY symbol ORDER BY time RANGE '15 min' PRECEDING FROM stocks_real_time WHERE symbol = 'AAPL'
which is reasonably performant, while also producing a smaller amount of error.
We'd like to reduce the error and have a performant moving aggregate mode, finding a better tradeoff by using higher precision floating point values or perhaps calling the numeric equivalents could be useful.
Notes from our discussion yesterday:
- Using a
TwoFloat
(pair off64
s) internally when in rolling aggregate mode increases accuracy by around a few orders of magnitude, but is ~30% slower to create (benchmarked usingpgbench
). Here's a query we were using to test the accuracy ofTwoFloat
compared tonumeric
and the current implementation - Using a Postgres
numeric
gives us the most accurate results possible (I use it as a source of truth for the correct values so I assume it is always exactly correct), but is a bit slower thanTwoFloat
.
When creating a stats aggregate in rolling aggregate mode we always convert it to a normal f64
-based aggregate in the final function, since the extra precision is much more useful for preserving accuracy in the transition function than in the functions that ultimately get called on the aggregate.
My current work to make the internal aggregate data generic over the data type, and add stats_agg_tf
which uses TwoFloat
s internally is on the sv/stats_agg-twofloat branch.
#599 added support for using TwoFloat
s internally when stats_agg
is in moving aggregate mode. I'll keep this issue open to track adding support for having stats_agg
be able to use numeric
to keep track of values as well.