astropy
astropy copied to clipboard
Better explain masked arrays in sigma_clip
For many users, sigma_clip may be where they encounter masked arrays for the first time. It would be nice to improve the docstring for sigma_clip:
http://docs.astropy.org/en/stable/api/astropy.stats.sigma_clip.html#astropy.stats.sigma_clip
to include a few lines on what a masked array is (including a printed view), just before we mention that Numpy ufuncs often understand them. Alternatively, we could link to http://docs.scipy.org/doc/numpy/reference/maskedarray.html but that is a bit extensive.
:+1: We have some internal users who been repeatedly confused about proper use of masked arrays.
Also I think linking to the Numpy docs makes sense to do no matter what, but a more straightforward "hands on" introduction might be ncie too.
Here's a short introduction to masked arrays which could go in the docstring, the astropy.stats
introduction, or both.
The function returns a masked array, a type of Numpy array used for handling missing or invalid entries. Masked arrays retain the original data but also store another Boolean array of the same shape where True
indicates that the value is masked. Most Numpy ufuncs will understand masked arrays and treat them appropriately. For example, consider the following dataset with a clear outlier:
>>> import numpy as np
>>> from astropy.stats import sigma_clip
>>> x = np.array([1, 0, 0, 1, 99, 0, 0, 1, 0])
The mean is skewed by the outlier:
>>> x.mean()
11.333333333333334
Sigma-clipping (3 sigma by default) returns a masked array, and so functions like mean
will ignore the outlier:
>>> clipped = sigma_clip(x)
>>> clipped
masked_array(x = [1 0 0 1 -- 0 0 1 0],
mask = [False False False False True False False False False],
fill_value = 999999)
>>> clipped.mean()
0.375
If you need to access the original data directly, you can use the .data
property. Combined with the .mask
property, you can get the original outliers, or the values that were not clipped:
>>> outliers = clipped.data[clipped.mask]
>>> outliers
array([99])
>>> valid = clipped.data[~clipped.mask]
>>> valid
array([1, 0, 0, 1, 0, 0, 1, 0])
For more information on masked arrays, including see the Numpy documentation.
Would it be worthwhile including a mention of how NaNs interact with this function? I think a key advantage over the scipy version is that it handles NaN values and masks them automatically.
@astrofrog If this issue is still open, can I work on it?
@shwetamore1295 - yes!
@taldcroft @embray @astrofrog Please review my commit .I have explained the masked array in sigma_clip.py
I think the intro to masked arrays that @swt30 wrote should go in the user guide somewhere, in some form. The docs specifically for sigma_clip
should just link to that, as should other functions in Astropy that employ masked arrays. As there are several such functions I don't know that the astropy.stats
docs is necessarily the best place for it (however it could still use examples from astropy.stats
.
@AMAN3003 Can you do a pull request for your commit and then we should be able to comment on it. However, as @embray says, it might be better to add it into the generally documentation or a separate page within the documentation dealing with how masked arrays are used.
within sigma clip is there a straightforward way to mask nd arrays? i.e. without loosing the dimensions of the array how could one apply sigma clip to mask outliers in a 2d or 3d array and extract an output array with similar shape to the input buyt with outliers masked?
Hello! Happy to work on this but I'm new to this project (and open source😅). Could you point me to where you want the explanation exactly?
@astrofrog @crawfordsm @taldcroft @embray I have added the PR for this issue plz have a look.