netcdf4-python
netcdf4-python copied to clipboard
Supprising type of nc.variables[foo][:]. (ndarray or MaskedArray depending on its data)
Hi all,
for me it is surprising, that the return of nc.variables[foo][:] is either a ndarray or a MaskedArray depending on its data content: MaskedArrays are used as long as there is any "missing" entry in the data, ndarray otherwise. Even more surprising this data depending rules also apply to slices:
import netCDF4
import numpy as np
print(netCDF4.__version__)
with netCDF4.Dataset('test.nc', 'w') as nc:
dim = 'test'
nc.createDimension(dim, size=10)
var = nc.createVariable('test-var', np.int,
dim, fill_value=5)
var[:] = np.arange(10)
print('var[:]', var[:], type(var[:]))
print('var[:5]', var[:5], type(var[:5]))
print('var[:6]', var[:6], type(var[:6]))
output:
1.3.1
var[:] [0 1 2 3 4 -- -- -- -- --] <class 'numpy.ma.core.MaskedArray'>
var[:5] [0 1 2 3 4] <class 'numpy.ndarray'>
var[:6] [0 1 2 3 4 --] <class 'numpy.ma.core.MaskedArray'>
On the first glance it might seem convenient to get a MaskedArray, only if there is any missing data, but it confuses me very much, that one slice has another type (and behavior) as the other. And it is the same for the whole [:]: Imagine one day-file with missing entries, but another day without any.
If one likes to work with MaskedArrays, this means that one always have to check for the type and add a full False mask in case of no missing entries.
Is there any chance to make the return of var[:] more consistent and less surprising?
A) One could suggest always returning MaskedArrays, as long as the auto-masking is turned on.
But this would introduce MaskedArray into code, that was never aware of having any missing values in its input data. And this would lead to very undesired results.
Before: read data does not contain any missing data, so it was returned as an array called old.
Now: Read data would be returned as MaskedArray called new with full False mask.
Imagine this old code printing [2]:
old = np.arange(3)
old_log = np.log(old-1)
print(old[np.isfinite(old_log)])
But with a MaskedArray, the result would be [0, 1, 2]:
new = np.ma.MaskedArray(np.arange(3))
new_log = np.log(new-1)
print(new[np.isfinite(new_log)])
B) Or one could always return MaskedArray, as long as the auto-masking is turned on and turn auto-masking off by default.
C) Or as an extension of B): the auto-masking is turned off for variable which have no specific FillValue (or valid_range etc.) set in the netcdf file. I assume, that a file author expecting no missing values, does normally not set a FillValue. Therefore, programs reading these files do not expect any missing data. But authors aware of missing data in their files, probably set an explicit FillValue.
I think, I would prefer B or C. Personally I would go for B, because I don't like working with MaskedArrays, but I can imagine B breaking code, that depends of reading MaskedArrays (so having at least one missing value).
I'm interested in how others think about this issue.
In retrospect, I perhaps should not have made this the default behavior. I don't think it's wise to change it now though, since too much code relies on it. However, you can change it so that numpy arrays are always returned by using the set_auto_mask Dataset method. For example
nc = Dataset('test.nc')
nc.set_auto_mask(False)
will cause all data to be returned as numpy arrays. Additionally,
nc.set_auto_scale(False)
will suppress the auto-rescaling of data packed as short integers.
nc.set_auto_maskandscale(False)
will do both.
Why is it so bad to get a masked array back when the slice contains missing data, and a numpy array when it does not? numpy.ma is pretty much a drop in replacement for numpy.ndarray - is there a particular situation where getting a masked array instead of a numpy array (or vice versa) causes a problem?
Why is it so bad to get a masked array back when the slice contains missing data, and a numpy array when it does not?
I'm not the sharpest tool in the toolbox, but for me, this means that everywhere I read data from a netcdf variable, I have to do mask checking to see if I need to deal with masked data or not in my code logic. So I have to add code like
if isinstance(my_array, np.ma.MaskedArray):
As I pointed out in my example, MaskedArrays are unfortunately not a drop-in replacement for arrays, at least in these edge cases, that rely on functions returning nan for invalid input. I could imagine, that np.log(-1) -> np.nan is undefined, but I checked the docs and it is actually noted
For real-valued input data types,
logalways returns real output. For each value that cannot be expressed as a real number or infinity, it yieldsnanand sets theinvalidfloating point error flag.
Off course I can use @akrherz if-case or always set_auto_mask(False), but I would prefer changing the default behavior of this pretty useful library to make it less surprising. (Actually I was surprised first, when I realized that netCDF4 started to apply the scale and offset after some update some yeas ago)
I don't think it's wise to change it now though, since too much code relies on it.
What about my option C? Turning auto_mask off if there is no explicit _FillValue set in the data.
Honestly, I have no idea, if any or how many netcdf files mark missing entries with the default FillValue. From my experience I don't know any, but this could off course be a misleading sample. So instead of using the default FillValue, if there is no explicit FillValue use no FillValue, which results in no detected missing entries, which results in returning an array in the old logic.
Second point, is there really code, that relays on getting an array for a complete slice and a MaskedArray for an incomplete slice?
I totally agree, that it is always risky to change default behavior because old code might relay exactly on this behavior. But it is good to have this discussion to elucidate all pros and cons of probable side effects and if they are real side effects.
There is always a _FillValue - if it's not set explicitly, then the C library uses the type-dependent default value. If you create a variable and don't write any data to it, then read it back it will always be masked.
The only potential option I see for changing the default behavior is to always return a masked array.
There is always a _FillValue - if it's not set explicitly, then the C library uses the type-dependent default value. If you create a variable and don't write any data to it, then read it back it will always be masked.
Yes and no. In the current implementation, non filled entries are populated with the default fill value and are interpreted as such but the _FillValue attribute is not defined:
import netCDF4
import numpy as np
print(netCDF4.__version__)
with netCDF4.Dataset('test_without_fillvalue.nc', 'w') as nc:
nc.createDimension('dim', size=10)
var = nc.createVariable('test', 'f8', 'dim')
print("without fill value `hasattr(var, '_FillValue')`:", hasattr(var, '_FillValue'))
print("without fill value `nc.variables['test'][:]`:", nc.variables['test'][:])
with netCDF4.Dataset('test_without_fillvalue.nc', 'r') as nc:
print("without fill value `hasattr(nc.variables['test'], '_FillValue')`:", hasattr(nc.variables['test'], '_FillValue'))
print("without fill value `nc.variables['test'][:]`:", nc.variables['test'][:])
with netCDF4.Dataset('test_with_fillvalue.nc', 'w') as nc:
nc.createDimension('dim', size=10)
var = nc.createVariable('test', 'f8', 'dim', fill_value=1.)
print("with fill value `hasattr(var, '_FillValue')`:", hasattr(var, '_FillValue'))
print("with fill value `nc.variables['test'][:]`:", nc.variables['test'][:])
with netCDF4.Dataset('test_with_fillvalue.nc', 'r') as nc:
print("with fill value `hasattr(nc.variables['test'], '_FillValue')`:", hasattr(nc.variables['test'], '_FillValue'))
print("with fill value `nc.variables['test'][:]`:", nc.variables['test'][:])
with netCDF4.Dataset('test_with_default_fillvalue.nc', 'w') as nc:
nc.createDimension('dim', size=10)
var = nc.createVariable('test', 'f8', 'dim', fill_value=netCDF4.default_fillvals['f8'])
print("with default fill value `hasattr(var, '_FillValue')`:", hasattr(var, '_FillValue'))
print("with default fill value `nc.variables['test'][:]`:", nc.variables['test'][:])
Output:
1.3.1
without fill value `hasattr(var, '_FillValue')`: False
without fill value `nc.variables['test'][:]`: [-- -- -- -- -- -- -- -- -- --]
without fill value `hasattr(nc.variables['test'], '_FillValue')`: False
without fill value `nc.variables['test'][:]`: [-- -- -- -- -- -- -- -- -- --]
with fill value `hasattr(var, '_FillValue')`: True
with fill value `nc.variables['test'][:]`: [-- -- -- -- -- -- -- -- -- --]
with fill value `hasattr(nc.variables['test'], '_FillValue')`: True
with fill value `nc.variables['test'][:]`: [-- -- -- -- -- -- -- -- -- --]
with default fill value `hasattr(var, '_FillValue')`: True
with default fill value `nc.variables['test'][:]`: [-- -- -- -- -- -- -- -- -- --]
ncdump:
netcdf test_without_fillvalue {
dimensions:
dim = 10 ;
variables:
double test(dim) ;
// global attributes:
:_NCProperties = "version=1|netcdflibversion=4.4.1.1|hdf5libversion=1.8.18" ;
data:
test = _, _, _, _, _, _, _, _, _, _ ;
}
netcdf test_with_fillvalue {
dimensions:
dim = 10 ;
variables:
double test(dim) ;
test:_FillValue = 1. ;
// global attributes:
:_NCProperties = "version=1|netcdflibversion=4.4.1.1|hdf5libversion=1.8.18" ;
data:
test = _, _, _, _, _, _, _, _, _, _ ;
}
netcdf test_with_default_fillvalue {
dimensions:
dim = 10 ;
variables:
double test(dim) ;
test:_FillValue = 9.96920996838687e+36 ;
// global attributes:
:_NCProperties = "version=1|netcdflibversion=4.4.1.1|hdf5libversion=1.8.18" ;
data:
test = _, _, _, _, _, _, _, _, _, _ ;
}
The test_without_fillvalue does not contain an explicit fill value. This is tested with hasattr(). The test_with_fillvalue contains an explicit fill value as one would expect. And test_with_default_fillvalue uses the default fill value explicitly, but in contrast to the first case, this value is explicitly saved to the file.
At least as far as I can test it, there is a difference between an implicit use of the default fill value and an explicit use.
The point is that whether or not it is set explicitly, there is an expectation that values equal to the _FillValue are missing.
But do people expect that values equal to the default fill value are actually missing? Do people at all expect to have missing values, if _FillValue is not set explicitly?
That's something that I don't know.
From my experience: If there are missing values, some _FillValue is set. Or people fill missing values with NaN but don't set an explicit FillValue. (In the latter case, the auto mask doesn't mask out these NaNs anyway, since they are different from the default fill value)
yes and yes. _FillValue is meant for pre-filling as yet unwritten data. missing_value is meant to indicate invalid or missing data. See https://www.unidata.ucar.edu/software/netcdf/docs/BestPractices.html.
If the need is to know exactly what kind of array the module will return, the I propose we always return a masked array by default. If you always want a numpy array, you can use set_auto_mask(False). The only change to the current behavior would then be that a masked array would be returned even if the slice does not contain any masked values.
I am strongly in support of type stable interfaces. It should not be data dependent whether indexing returns a NumPy arrays or a masked array.
If you always want a numpy array, you can use set_auto_mask(False).
I like synchronizing this behavior with set_auto_mask().
However, it's also worth considering whether to change the default behavior to not do masking and always return NumPy arrays intead.
Don't think I'm ready to change the default yet - still seems to me that masked arrays are what most people want. I will go ahead and merge pull request #787.
Does the #787 patch convert everything into MaskedArray, so even a coordinate vector? Under most of all circumstances, I expect a coordinate vector to have have no missing values. Or does it only return a slice as a masked array, if there is any missing value in the whole variable?
Most of my code actually relays on not having any missing values in a variable. So in most cases I expect having an array. As I tried to show with my example in the second post, there is a difference in data handling when it comes to finity/NaN and boolean indexing. So it is very likely that in these cases getting a masked array instead would break my code.
I also see the point that changing the default behavior to not do masking brakes code, that is working with MaskedArray and relays on reading in such data.
So both extremes will break existing code. So I agree, that changing the default is not an easy thing.
Therefore I started thinking about some middle way, but this also seams to clash with current code.
The only chance might be so start a netCDF4.Dataset2 with new default behavior learned from all the lessens learned of the current code, but thats probably a long way...
All variable data is now returned as a masked array by default, even coordinate variables.
You could over-ride the default behaviour with your own subclass, for example
import netCDF4
class myDataset(netCDF4.Dataset):
def __init__(self, *args, **kwargs):
super(myDataset, self).__init__(*args, **kwargs)
self.set_auto_mask(False)
I would consider changing the default to return numpy arrays, it's just that I'm not yet convinced that would be the best choice. It still seems to me that most users would prefer masked arrays, although I could be convinced otherwise.
Hi Jeff, as most of my code is not expecting to read in any data as MaskedArray at all, I am not very happy with your recent merge. I will break my code. (But because I am now aware of this change, it would probably only take me only little time to insert lots of set_auto_mask(False), but people who are not aware of this change would probably spend a lot of time to debug their code after updating netCDF4 in the future).
I have no idea, what the preference of the users majority is, but I would say, that one should avoid changes that are very likely to break existing code in a subtle way.
@MeraX can you provide an example of how this breaks your code? I think this change is good as it follows the 'explicit rather than implicit' coding paradigm. How does your code work with missing data? Does it always check to see if the array contains the filled values? I don't wish to argue, just trying to understand the issue :)
@akrherz I posted my answer (https://github.com/Unidata/netcdf4-python/pull/787#issuecomment-376213497) into the pullrequests thread, because I think it is more related to that change than to my initial issue posted here.
I recently updated my netCDF4 module to 1.5.1.2 and was surprised by this default behavior of always returning masked arrays, since it broke my code. Googling led me here, and I can add to the use cases where always having a masked array is not desirable/breaks code. My task has several steps, and (say) in step 1 I'm reading some coordinate variables from a netcdf file, which (for various reasons) are needed in step 2, but can't be read in step 2 from the same netcdf file. Since they're uncomplicated scalar coordinate variables, I've been using numpy's 'save' method on arrays to save them to files in step 1, which are then read in step 2. Yes, 'save' is not portable, but all of this is happening on the same types of compute platforms, with the exact same numpy library, so portability is a non-issue. This workflow is now broken because numpy's save/load does not handle masked arrays.
Sorry to hear this change caused you grief. I would recommend using the Dataset.set_auto_mask method to always get non-masked numpy arrays.
What does set_auto_mask do to variables in netcdf files that actually have masked elements? If a variable in the file has some masked elements (i.e., filled with _FillValue), then I'd like that variable to be read in as a masked array. Does set_auto_mask(False) still do that?
set_auto_mask(False) will always return non-masked numpy arrays. To restore the previous default behavior (only return masked arrays if there are missing elements), use set_always_mask(False).
I had a bit similar issue I am writing data to nc file using netCDF4-1.5.6 and it is always a masked array. I tried to use
set_auto_mask(False) OR set_always_mask(False)
as an attribute for the variable and for the dataset but it does not work. Any help?
set_auto_mask(False) only affects reading. What do you mean by 'you are writing data and it is always a masked array"?
I am reading data from .dat files do some calculations and store them to a nc file here is a sample of what my code looks like:
import netCDF4 as nc
import numpy as np
ncfile = nc.Dataset("test.nc", "w", format="NETCDF3_CLASSIC")
vardim = ncfile.createDimension("vardim", 5)
var = ncfile.createVariable("var", "f4", ("vardim",) )
a = np.arange(5)
var[:] = a
when I read the output test.nc file using netCDF4 again in another code or using ncdump the data is masked
That is because you need to set Dataset.set_auto_mask(False) when you re-open the file.
Thanks. Now it is not masked in the python code but still masked for ncdump. Is there a way to write data properly to NC file? As it is not masked but all of the elements have Fill_value as shown:
> array([9.96921e+36, 9.96921e+36, 9.96921e+36, 9.96921e+36, 9.96921e+36],
> dtype=float32)
Then I used set_fill_off() then the array has only zeros!
btw I am using netCDF4- 1.5.6
Okay, I found that I did not close the file after writing to it I needed to add
ncfile.close()
Now it works