cockroach
cockroach copied to clipboard
sql/stats: remove statistics forecasting time horizon
When generating statistics forecasts, we must pick a time at which to forecast. To make forecasts deterministic, this time must be based on only the existing collected statistics for the table (so it can't be the current time, for example). We calculate this time as:
most recent collection time + average time between auto stats
Effectively making the forecast time the expected next auto stats collection time for the table.
Out of an (over)abundance of caution, I added an upper bound of 1 week to this time calculation so that we wouldn't forecast too far into the future. But now that we're testing with real stats we see that large tables tend to have less frequent auto stats collections. It's difficult to say what a reasonable upper bound would be for all tables; small tables that change frequently might need new stats after an hour, while large tables that change infrequently might need new stats after a few months.
Rather than try to guess an upper bound that would be reasonable for all tables, I think we should remove the time horizon altogether. I think we can trust the average time between auto stats calculation to limit the forecast to something reasonable for the table, regardless of the rate of change and the size of the table.
Fixes: #86395
Release note: None
The big logictest is an obfuscated and edited version of the query from support case 1401. It's mostly the stats that make it so large. This is after removing most columns, so I'm not sure it can be shrunk further.
Ok, reduced the file by 2/5ths. In terms of lines it is now the 13th largest file at 18621 lines. In terms of bytes it is now the 36th largest file at 683k bytes. The testcase itself takes 1.1s to run. I think this is reasonable to check in (there are other larger testcases such as pkg/sql/importer/testdata/avro/stock-10000.bjson
which is 2m).
PTAL.
Would it be possible to get an uncertainty estimate for the forecast as a function of how far in the future it is? That might be able to provide a more useful horizon. (I haven't looked at this code much, so there might already be something like this)
Would it be possible to get an uncertainty estimate for the forecast as a function of how far in the future it is? That might be able to provide a more useful horizon. (I haven't looked at this code much, so there might already be something like this)
Do you have a specific uncertainty function you're thinking of? I think it would also have to be a function of the rate of change of the table, right?
Assuming we used it to find the time at which forecast uncertainty went above some constant threshold, I think this time would be similar to the expected-next-auto-stats-collection time. That is, roughly speaking, it would be sooner for tables that change rapidly and further out for tables that change slowly.
Good points! Thank you for the thoughtful discussion! And TFYR!
bors r=DrewKimball
Ok, reduced the file by 2/5ths. In terms of lines it is now the 13th largest file at 18621 lines. In terms of bytes it is now the 36th largest file at 683k bytes. The testcase itself takes 1.1s to run. I think this is reasonable to check in (there are other larger testcases such as
pkg/sql/importer/testdata/avro/stock-10000.bjson
which is 2m).
Nice work!