rasterio icon indicating copy to clipboard operation
rasterio copied to clipboard

BUG: Return true minimum dtype when np.inf/nan present & handle empty arrays

Open groutr opened this issue 2 years ago • 10 comments

When get_minimum_dtype was called with np.inf or np.nan, it would return float64. This PR fixes it to only do the range check on the finite values.

Additionally, I return the smallest dtype when the containers are empty, though, it could be argued to let the exceptions bubble up. Another approach would be to return None (ie GDT_Unknown). What would be preferred approach? My opinion is that either the smallest dtype (bool) or None should be returned.

ping @sgillies

Split off of #2722

groutr avatar Jan 23 '23 18:01 groutr

Thanks for these changes. I think in the scenario when you don't have any data to go off of, it is safer to pick the largest dtype. I think that is why numpy uses float64 for an empty array.

>>> import numpy
>>> arr = numpy.array([])
>>> arr
array([], dtype=float64)

For that reason, I suggest we follow the logic numpy is using and instead default to float64 for empty arrays.

snowman2 avatar Jan 25 '23 03:01 snowman2

After thinking about it further, I think what you have makes sense for the "minimum" dtype. So, ignore my previous comment.

snowman2 avatar Jan 25 '23 04:01 snowman2

@snowman2 After looking over this PR some more, I realized there is some other subtle issues with get_minimum_dtype.

  1. Bool should be return if range [0, 1] (was returning uint8)
  2. Never returns complex dtype.

I restructured get_minimum_dtype to handle both of those issues and be more comprehensible.

After thinking it over more, None might be a more consistent return (or maybe even raise a ValueError?) on an empty container. There are two ways of thought, and I'm not sure which is the preferred way: 1) No values mean an unknown dtype (GDT_Unknown) and therefore return None/raise an error or 2) No values mean any dtype is possible and therefore return the smallest/minimum dtype. If get_minimum_dtype returned None or raised an error, it means every call would need to be checked and calling code would look like

# returning None
dtype = get_minimum_dtype([])
if dtype:
    np.can_cast(dtype, 'float32')
else:
    # no values, or some unrecognized dtype (not int/float/complex)

# raise an error
try:
    dtype = get_minimum_dtype([])
    np.can_cast(dtype, 'float32')
except ValueError:
    print("No values???")
else:
    print("Unknown dtype (not int/float/complex)

I could see the constant checks after calling get_minimum_dtype could be annoying.

groutr avatar Jan 25 '23 19:01 groutr

Bool should be return if range [0, 1] (was returning uint8)

See #2657

Boolean is not a GDAL type in the mapping: https://github.com/rasterio/rasterio/blob/6bb3e52e19826845701c335fb58c142d2d6baa53/rasterio/dtypes.py#L32-L45

In rasterize, it doesn't have the boolean dtype: https://github.com/rasterio/rasterio/blob/6bb3e52e19826845701c335fb58c142d2d6baa53/rasterio/features.py#L255-L257

Boolean appears to only be used for masks. From what I can tell, uint8 appears to be correct.

snowman2 avatar Jan 26 '23 03:01 snowman2

@groutr, I think you have raised some valid points. I think the inf/nan changes are good to go. However, the empty array logic appears to need more thought/discussion. Mind moving the empty array changes/discussion to a separate PR?

snowman2 avatar Jan 26 '23 03:01 snowman2

Never returns complex dtype.

Mind moving these changes into a separate PR as well? I thin this topic is "complex" enough to be on its own :smile:

snowman2 avatar Jan 26 '23 03:01 snowman2

@groutr, I think you have raised some valid points. I think the inf/nan changes are good to go. However, the empty array logic appears to need more thought/discussion. Mind moving the empty array changes/discussion to a separate PR?

Let's get #2734 merged first. Having several different open PRs against the same function is really confusing for me. With that PR merged, it would remove the non-array path and this PR could be simplified.

Regarding the bool return type, I was going off the docstring for get_minimum_dtype where it says the return is a "rasterio dtype string". What that is exactly isn't clear, but I took it to mean the these:

bool_ = 'bool'
ubyte = uint8 = 'uint8'
sbyte = int8 = 'int8'
uint16 = 'uint16'
int16 = 'int16'
uint32 = 'uint32'
int32 = 'int32'
uint64 = 'uint64'
int64 = 'int64'
float32 = 'float32'
float64 = 'float64'
complex_ = 'complex'
complex64 = 'complex64'
complex128 = 'complex128'
complex_int16 = "complex_int16"

groutr avatar Jan 26 '23 16:01 groutr

What that is exactly isn't clear, but I took it to mean the these

Your logic makes sense. However, since bool doesn't map to a GDAL dtype, this will cause troubles in the rasterize function it is is used in as it is looking for the rasterio dtype with a GDAL dtype equivalent.

To help make the function clearer, I am thinking that it may be a good idea to update the docstring.

snowman2 avatar Feb 06 '23 15:02 snowman2

@sgillies There are several questions in this PR that need guidance/clarification.

  1. What is the proper return for empty values? I feel None is the appropriate return value. Bool is another possible return or we could raise an error.
  2. Should get_minimum_dtype consider complex values? What about bool values?
  3. What exactly is a rasterio dtype string? Are they numpy dtypes or GDAL types? Should we consider the mapping idea proposed here: https://github.com/rasterio/rasterio/issues/2763#issuecomment-1423238440?

groutr avatar May 02 '23 21:05 groutr

@groutr @snowman2 how about this?

  1. get_minimum_dtype() should be calculated using np.nanmin() and np.nanmax().
  2. operation on an empty array should give a ValueError like numpy does: ("ValueError: zero-size array to reduction operation minimum which has no identity"). But maybe a little more clear 😄

Removing implicit use of this function from rasterize() makes its ergonomics more important, yes? We're kind of asking users to call it in their own code instead. What would you think about changing the function def to take multiple values so that a developer can plug in their dataset's min and max values without creating a list? Behavior like Python's min() or max(). For example get_minimum_dtype(-9999, 15).

sgillies avatar Jan 05 '24 16:01 sgillies