arch icon indicating copy to clipboard operation
arch copied to clipboard

ENH: Add warning when data set size is large and lags is not directly set

Open cTatu opened this issue 2 years ago • 5 comments

Hi,

I was wondering if there was a way to make the unit root tests faster. I have multiple dataset consisting of hundreds of millions of samples each. The computation takes hours and it's only using 1 core from all 160 cores that I have available. I will need to run this multiples times in a iterative process so waiting so much just for 1 test is pretty unfeasible. Could be there an easy way to enable parallelization using for example ProcessPoolExecutors? or even GPU through Cupy?

Thank you

cTatu avatar Aug 21 '22 22:08 cTatu

A couple of things.

  1. DFGLS is a waste in large dataset. The reason DFGLS was created was low power in small samples. With 1000+ point datasets, there is no reason to not just use ADF
  2. In both ADF and DFGLS it is essential to minit max lags to something small when datasets are large. The default uses a value that grows with sample size, so that in very large datasets the effective maxlags is extremely large, which makes estimation show.

bashtage avatar Aug 21 '22 23:08 bashtage

I just did a quick check and the big problem with large datasets is the autolag. It does a search across 0, 1, 2, ... max_lags lags in the model which is slow. You can get large speedups by setting a specific number of lags, even if it is fairly high. I quick profile with 10,000,000 observations

image

If you set lags it is much faster.

It is also multi-threaded when fitting the actual model since it relies on BLAS (you should use a good BLAS, either MKL or OpenBLAS - this will be automatic in most cases, unless you compile your own NumPy).

bashtage avatar Aug 22 '22 09:08 bashtage

Here is a comparrison with 10,000,000 obs

In [5]: %timeit DFGLS(y, lags=10).stat
7.77 s ± 119 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit DFGLS(y).stat
1min 54s ± 3.94 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

The most observations you have, the larger the difference will get.

bashtage avatar Aug 22 '22 10:08 bashtage

Setting the lags solved the problem. The time was reduced more than x20! Thank you very much ✌️

cTatu avatar Aug 24 '22 10:08 cTatu

You are welcome. I am going to reopen this since it would be good to have some sort of warning in this case. Pretty sure this has come up before either here or on SO.

Could also default faster way in these cases, e.g., simple-to-general search with a stopping rule.

bashtage avatar Aug 24 '22 10:08 bashtage