ccdproc Sigma clipping is painfully slow

I don't know what the underlying performance issue is, but wow it is slow (clipping a stack of 25 4k x 4k images takes minutes).

Jun 04 '18 04:06 mwcraig

About 4.5 minutes, but it feels longer..

Jun 04 '18 04:06 mwcraig

I also noticed that it's very slow. Thanks for bringing this up (I always forgot).

Could you provide a random-dataset for profiling? I could be interesting depending on the data-types and the strides of the dataset (the actual data is probably not important except it contains lots of unusual values like NaNs and so forth.

Also because we're re-using astropys sigma-clipping it would be handy to know what astropy version you were using.

One additional thing that could be a problem is RAM. Depending on the RAM and size of the arrays there may be lots of swapping going on.

Sorry for asking so much information from you but without having the specifics it's probably hard to profile it (correctly).

Jun 04 '18 06:06 MSeifert04

This really brings up the issue that we need to add more benchmarks to astropy-benchmarks, currently there is nothing for the stats package. While the 4.5mins long test is not feasible, if it's not a local RAM issue I suspect it can be scaled back to be useful but still slow?

Jun 04 '18 12:06 bsipocz

Sure; I can also provide the data set I was working with if you want. It is just a bunch of dark frames. It is conceivable the bottle neck is in the combining, not the clipping, I suppose.

Jun 04 '18 14:06 mwcraig

Probably related, @larrybradley is working on speeding up the astropy one: astropy/astropy#7478

Jun 04 '18 15:06 SaOgaz

Very related, thanks for mentioning that, @SaOgaz. And, of course, thanks to @larrybradley for working on it!

Jun 04 '18 15:06 mwcraig

I haven't had a time to do much profiling, but here is an example of a set of calls that takes several minutes to run (for the combine part). Adding in some outlying pixels seems to help drive up the run time ( about 30 secs faster without that).

There are a number of things here that can lead to a speedup, I think. Even without sigma clipping, it takes close to a minute to combine the images.

Away from keyboard for a week, but will take another look when I'm back...

import numpy as np
from astropy.io import fits
from ccdproc import combine
from astropy.stats import median_absolute_deviation

size = [4096, 4109]
n_images = 25

base_name = 'test-combine-{num:03d}.fits'

for num in range(n_images):
    data = np.random.normal(size=size)
    # Now add some outlying pixels so there is something to clip
    n_bad = 50000
    bad_x = np.random.randint(0, high=size[0] - 1, size=n_bad)
    bad_y = np.random.randint(0, high=size[1] - 1, size=n_bad)
    data[bad_x, bad_y] = np.random.choice([-1, 1], size=n_bad) * (10 + np.random.rand(n_bad))
    hdu = fits.PrimaryHDU(data=data)
    hdu.header['for_prof'] = 'yes'
    hdu.header['bunit'] = 'adu'
    hdu.writeto(base_name.format(num=num), overwrite=True)

ic = ImageFileCollection('.', '*')
fils = ic.files_filtered(for_prof='yes')
combine(fils, output_file='foo12.fit', sigma_clip=True, sigma_clip_low_thresh=5, sigma_clip_high_thresh=5, sigma_clip_func=np.ma.median, sigma_clip_dev_func=median_absolute_deviation)

Edit 2018-06-23: move misplaced parentheses in hdu.writeto

Jun 10 '18 18:06 mwcraig

@mwcraig For your sigma-clipping stdfunc (sigma_clip_dev_func) you should probably use mad_std (http://docs.astropy.org/en/stable/api/astropy.stats.mad_std.html) instead of median_absolute_deviation. mad_std is a robust standard deviation estimator. For reference, see e.g. https://en.wikipedia.org/wiki/Median_absolute_deviation#Relation_to_standard_deviation. Just an FYI -- it has no effect on performance. :smile:

Jun 15 '18 17:06 larrybradley

One other note: If you want to clean up the combine API a bit, you could use the SigmaClip object instead of having a bunch of keywords to control the sigma clipping. That's exactly the use case for why I added the SigmaClip class.

i.e. instead of this:

combine(fils, output_file='foo12.fit', sigma_clip=True, sigma_clip_low_thresh=5, sigma_clip_high_thresh=5, sigma_clip_func=np.ma.median, sigma_clip_dev_func=median_absolute_deviation)

you could do this:

sigclip = SigmaClip(sigma_lower=5, sigma_upper=5, cenfunc=np.ma.median, stdfunc=mad_std)
combine(fils, output_file='foo12.fit', sigma_clip=sigclip)

where sigma_clip=None would mean do not perform sigma clipping.

If interested, I could put in a PR.

Jun 15 '18 17:06 larrybradley

One thing that I noticed during my tests is that we actually do most operations on the stacked arrays in Combiner along the first axis. For C contiguous arrays that's actually the worst axis to do operations on because elements along the first axis have the biggest "step" (in memory) between them. However stacking along the first axis is something that is faster than stacking along the last axis (which would be the fastest axis to "reduce" along).

I'm not sure if that matters much because we stack and reduce, however from the number of operations the last axis would be better in cases when we do clipping:

We stack once
We reduce once for the baseline, once for the deviation, then we reduce once for the "combined" value, once for the "combined" mask and once for the "combined" uncertainty.

However processors are typically very clever what to pre-fetch so it might not be too bad that we reduce along the "worst" (memory-layout-wise) axis.

Then I played a bit with possible optimizations, however it's really tricky to make optimizations because we have to "assume" things, for example it's easily possible with a numba function (or C extension) to speed up the clipping by a factor of 3-20 (depending on the shape of the stack), however that's assuming that the stack is 3D, that the clip-functions are either (np.ma.mean and np.ma.std) or (np.ma.median and astropy.stats.median_absolute_deviation). And do we really want to include numba (a heavy dependency if one doesn't have numba) or include code to compile (particular problematic for windows users that don't use anaconda) just for some particular combinations, even though they are probably the most common ones?

Jul 07 '18 16:07 MSeifert04

I've a had chance to do a bit of profiling; the notebook I used to generate the profiles is in this gist.

The notebook combines 25 images, each roughly 2k x 2k, combined by averaging, sigma clipped before combining.

The upshot is that numpy masked median is really, really, slow. Two thirds of the time in the combination is spent in sigma clip; each of the three medians it does take roughly 7 seconds. Of that, a surprising amount of time is spent in numpy MaskedArray's __getitem__ (10 seconds).

A screenshot from SnakeViz is below. Following @larrybradley's approach of using the nan... speeds things up some, and bottleneck does even better.

profile_fun_mamedian_dev_mask_med_abs_dev

Aug 09 '18 03:08 mwcraig

ccdproc ccdproc copied to clipboard

Sigma clipping is painfully slow

ccdproc
ccdproc copied to clipboard