cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

sql/stats: remove statistics forecasting time horizon

Open michae2 opened this issue 2 years ago • 2 comments

When generating statistics forecasts, we must pick a time at which to forecast. To make forecasts deterministic, this time must be based on only the existing collected statistics for the table (so it can't be the current time, for example). We calculate this time as:

most recent collection time + average time between auto stats

Effectively making the forecast time the expected next auto stats collection time for the table.

Out of an (over)abundance of caution, I added an upper bound of 1 week to this time calculation so that we wouldn't forecast too far into the future. But now that we're testing with real stats we see that large tables tend to have less frequent auto stats collections. It's difficult to say what a reasonable upper bound would be for all tables; small tables that change frequently might need new stats after an hour, while large tables that change infrequently might need new stats after a few months.

Rather than try to guess an upper bound that would be reasonable for all tables, I think we should remove the time horizon altogether. I think we can trust the average time between auto stats calculation to limit the forecast to something reasonable for the table, regardless of the rate of change and the size of the table.

Fixes: #86395

Release note: None

michae2 avatar Sep 22 '22 20:09 michae2

This change is Reviewable

cockroach-teamcity avatar Sep 22 '22 20:09 cockroach-teamcity

The big logictest is an obfuscated and edited version of the query from support case 1401. It's mostly the stats that make it so large. This is after removing most columns, so I'm not sure it can be shrunk further.

michae2 avatar Sep 22 '22 20:09 michae2

Ok, reduced the file by 2/5ths. In terms of lines it is now the 13th largest file at 18621 lines. In terms of bytes it is now the 36th largest file at 683k bytes. The testcase itself takes 1.1s to run. I think this is reasonable to check in (there are other larger testcases such as pkg/sql/importer/testdata/avro/stock-10000.bjson which is 2m).

PTAL.

michae2 avatar Sep 23 '22 20:09 michae2

Would it be possible to get an uncertainty estimate for the forecast as a function of how far in the future it is? That might be able to provide a more useful horizon. (I haven't looked at this code much, so there might already be something like this)

DrewKimball avatar Sep 23 '22 21:09 DrewKimball

Would it be possible to get an uncertainty estimate for the forecast as a function of how far in the future it is? That might be able to provide a more useful horizon. (I haven't looked at this code much, so there might already be something like this)

Do you have a specific uncertainty function you're thinking of? I think it would also have to be a function of the rate of change of the table, right?

Assuming we used it to find the time at which forecast uncertainty went above some constant threshold, I think this time would be similar to the expected-next-auto-stats-collection time. That is, roughly speaking, it would be sooner for tables that change rapidly and further out for tables that change slowly.

michae2 avatar Sep 23 '22 23:09 michae2

Good points! Thank you for the thoughtful discussion! And TFYR!

bors r=DrewKimball

michae2 avatar Sep 25 '22 19:09 michae2

Build succeeded:

craig[bot] avatar Sep 25 '22 21:09 craig[bot]

Ok, reduced the file by 2/5ths. In terms of lines it is now the 13th largest file at 18621 lines. In terms of bytes it is now the 36th largest file at 683k bytes. The testcase itself takes 1.1s to run. I think this is reasonable to check in (there are other larger testcases such as pkg/sql/importer/testdata/avro/stock-10000.bjson which is 2m).

Nice work!

cucaroach avatar Sep 26 '22 12:09 cucaroach