Dealing with large data set

Open londumas opened this issue 6 years ago • 1 comments

I am trying to run with a large data set, ~200,000 eBOSS spectra, and stumbled upon an issue with memory. What would be the best strategy to deal with that? Is there an option float32, or should I split the spectra I am looking into computing in half according to lambdaRF and tape as best as I can after?

INFO: Starting EMPCA
       iter        R2             rchi2
Traceback (most recent call last):
  File "<HOME>/redvsblue/bin//redvsblue_compute_PCA.py", line 205, in <module>
    model = empca.empca(pcaflux, weights=pcaivar, niter=args.niter, nvec=args.nvec)
  File "<HOME>/Programs/sbailey/empca/empca.py", line 307, in empca
    model.solve_eigenvectors(smooth=smooth)
  File "<HOME>/Programs/sbailey/empca/empca.py", line 142, in solve_eigenvectors
    data -= np.outer(self.coeff[:,k], self.eigvec[k])    
  File "<HOME>/.local/lib/python3.6/site-packages/numpy/core/numeric.py", line 1203, in outer
    return multiply(a.ravel()[:, newaxis], b.ravel()[newaxis, :], out)
MemoryError

Mar 11 '19 23:03 londumas

The code would need to be updated in several places to have the calculation stay in float32 if the inputs are float32, e.g. line 204:

            mx = np.zeros(self.data.shape)

            mx = np.zeros_like(self.data)

A PR like that would be welcome, though in general I'm suspicious about the stability of single precision floating point calculations. Alternatives to consider:

Run on NERSC Cori with 128 GB/node (I know @londumas has access to that machine)
Run on a subset of the input data and cross check fits on the remainder of the data. I'm not sure that going from 100k to 200k input quasars will really give you that much more information, and reserving out 100k of them can be a useful cross check on overfitting anyway.
Run on a subset of the data to develop an initial model, and then iteratively add additional data that have a poor fit when using that model (i.e. bringing in data the cover phase space not covered by the original subset of the data, while not wasting memory on data that are already well described). Beware of any interpretation of the relative eigenvectors in that case, since your training set isn't representative of your full inputs, which may be fine for your cases.

I think any of those would be better than stitching together different redshift ranges or different wavelengths.

Mar 14 '19 19:03 sbailey