bigKRLS icon indicating copy to clipboard operation
bigKRLS copied to clipboard

question about the biggest N

Open stevenlis opened this issue 5 years ago • 7 comments

Thanks for the package! I'm trying to use it in my study with a sample size around 26,000 (10 covariates). However, in the following paper:

Messy Data, Robust Inference? Navigating Obstacles to Inference with bigKRLS:

bigKRLS can handle datasets up to approximately N = 14,000 on a personal machine before reaching the 8 GB cutoff

Thus, I'm concerned about whether I should continue. Does this mean the program will stop running if I fit a dataset with N > 14,000? I have a laptop with 16 GB RAM. Will it be OK?

stevenlis avatar Jan 20 '20 23:01 stevenlis

Thanks for your interest! bigKRLS should, when it runs out of RAM, switch the computation to disk (using swap, I.e., putting some on ROM). However those calculations are considerably slower and it is unlikely you would consider the speed trade off tolerable on 16gb RAM. There aren’t too many hyperparameters but they matter some; I recommend fitting at N=3000 to benchmark your machine and then increasing keeping the quadratic memory footprint in mind.

On Mon, Jan 20, 2020 at 3:11 PM StevenLi-DS [email protected] wrote:

Thanks for the package! I'm trying to use it in my study with a sample size around 26,000 (10 covariates). However, in the following paper:

Messy Data, Robust Inference? Navigating Obstacles to Inference with bigKRLS:

bigKRLS can handle datasets up to approximately N = 14,000 on a personal machine before reaching the 8 GB cutoff

Thus, I'm concerned about whether I should continue. Does this mean the program will stop running if I fit a dataset with N > 14,000? I have a laptop with 16 GB RAM. Will it be OK?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rdrr1990/bigKRLS/issues/37?email_source=notifications&email_token=AES2KXFDISUYYT3YU4XPG6TQ6YVRRA5CNFSM4KJLA3T2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IHPF4BA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AES2KXAXCO5HXVIMIWFC65TQ6YVRRANCNFSM4KJLA3TQ .

-- Pete Mohanty, PhD Data Scientist at Google

rdrr1990 avatar Jan 21 '20 04:01 rdrr1990

I tried with N = 5,000, and save.bigKRLS() generated a folder of files that takes 1.5 GB. Assume my total N = 25,000, then I would need an 1.5*(25000/5000)^2 = 37.5 GB RAM to run the model right?

stevenlis avatar Jan 21 '20 04:01 stevenlis

Close but not quite. Saving the full model output involves several NxN matrices (such as variance-covariance). I am away from my laptop but anticipate the output is \propto 5 N^2.

On Mon, Jan 20, 2020 at 8:36 PM StevenLi-DS [email protected] wrote:

I tried with N = 5,000, and save.bigKRLS() generated a folder of files that takes 1.5 GB. Assume my total N = 25,000, then I would need an 1.5*(25000/5000)^2 = 37.5 GB RAM to run the model right?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/rdrr1990/bigKRLS/issues/37?email_source=notifications&email_token=AES2KXDPUKWGL4SQEDHZXA3Q6Z3UHA5CNFSM4KJLA3T2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJOOR7I#issuecomment-576514301, or unsubscribe https://github.com/notifications/unsubscribe-auth/AES2KXDQUX6LPQEQWGSJGBDQ6Z3UHANCNFSM4KJLA3TQ .

-- Pete Mohanty, PhD Data Scientist at Google

rdrr1990 avatar Jan 21 '20 05:01 rdrr1990

When I run the whole model with N around 26000. I got the following error. Am I missing something? (sorry for the low-fi image) IMG_4865

stevenlis avatar Jan 21 '20 15:01 stevenlis

@rdrr1990 any hit?

stevenlis avatar Jan 23 '20 02:01 stevenlis

Hi Stephen,

I'm not quite certain about the source of the error, but it doesn't look like it's an issue with the bigKRLS package. I spent a little while digging through LAPACK's documentation, and it looks like error is something raised by LAPACK's eigenvalue solver.http://www.netlib.org/lapack/explore-html/d2/d8a/group__double_s_yeigen_gaeed8a131adf56eaa2a9e5b1e0cce5718.html The error code specifically seems to be from this line:

IF( n.GT.0 .AND. vu.LE.vl ) info = -8

The conditions mean that the order of the matrix is greater than zero and the upper bound on the eigenvalue solver is less than the lower bound. Those bounds aren't something that we set or even have the option to set (via Armadillo or Rcpp), so this is either a bug with Armadillo or a numerical/memory problem caused by an oversized input matrix. My guess is the latter, but it's difficult to be sure without a reproducible example. As an experiment, you might try running the bigKRLS estimation routine on a random subset of your data at a size that will clearly run (say, n=10,000 or so). If the estimation routine runs, then the issue is likely related to the size of the input matrix. If you see an error, though, then there's probably some other issue going on.

  • Robert

On Wed, Jan 22, 2020 at 9:05 PM StevenLi-DS <[email protected]mailto:[email protected]> wrote:

@rdrr1990https://github.com/rdrr1990 any hit?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/rdrr1990/bigKRLS/issues/37?email_source=notifications&email_token=ABNR7ZCEX52WN5EJLWKJZN3Q7D3NHA5CNFSM4KJLA3T2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJVYJ4Y#issuecomment-577471731, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABNR7ZG65GPUWJAPVVLQERDQ7D3NHANCNFSM4KJLA3TQ.

-- Postdoctoral Fellow, Perry World House, University of Pennsylvania Websitehttps://rbshaffer.github.io/

rbshaffer avatar Jan 26 '20 04:01 rbshaffer

Hi @rbshaffer. Thanks for the reply. I thought it was due to my sample size.

I've tried with a sample of my dataset with more than N = 13,000, which had no issue at all. I will try it again and see if there is anyway I can share the dataset.

stevenlis avatar Jan 26 '20 13:01 stevenlis