array-api icon indicating copy to clipboard operation
array-api copied to clipboard

RFC: add data type inspection utilities to the array API specification

Open kgryte opened this issue 2 years ago • 27 comments

This RFC proposes adding data type inspection utilities to the array API specification.

Overview

Currently, the array API specification requires that conforming implementations provide a specified set of data type objects (see https://data-apis.org/array-api/2021.12/API_specification/data_types.html) and casting functions (see https://data-apis.org/array-api/2021.12/API_specification/data_type_functions.html).

However, the specification does not include APIs for array data type inspection (e.g., an API for determining whether an array has a complex number data type or a floating-point data type, etc).

Prior Art

NumPy and its derivatives have dtype objects with extensive properties, including a kind property, which returns a character code indicating the general "kind" of data. For example, for relevant dtypes in the specification, NumPy uses the following character codes:

  • b: boolean
  • i: signed integer
  • u: unsigned integer
  • f: floating-point (real-valued)
  • c: complex floating-point
In [1]: np.zeros((3,4)).dtype.kind
Out[1]: 'f'

This availability of the kind property is useful when wanting to branch based on input array data types (e.g., applying summation algorithms).

if x.dtype.kind == 'f':
    # do one thing
else:
   # do another thing

In PyTorch, dtype objects have is_complex and is_floating_point properties for checking a data type "kind".

Additionally, PyTorch offers functional APIs is_complex and is_floating_point providing equivalent behavior.

Proposal

Given the proposal for adding complex number support to the specification (see https://github.com/data-apis/array-api/issues/373 and https://github.com/data-apis/array-api/pull/418), a greater need arises for the specification to require conforming implementations to provide standardized ways for data type inspection.

For example, conforming implementations will need to branch in abs(x) depending on whether x is real-valued or complex-valued. Similarly, in downstream user code, we can expect that users will inevitably encounter situations where they need to branch based on input array data types (e.g., when choosing summation algorithms).

As this specification has favored functional APIs, this RFC follows suit and proposes adding the following APIs to the specification:

has_complex_float_dtype(x: Union[array, dtype]) -> bool

Returns a bool indicating whether an input array has a complex number data type (e.g., complex64 or complex128).

has_real_float_dtype(x: Union[array, dtype]) -> bool

Returns a bool indicating whether an input array has a (real-valued) floating-point number data type (e.g., float32 or float64).

has_float_dtype(x: Union[array, dtype]) -> bool

Returns a bool indicating whether an input array has a complex or real-valued floating-point number data type (e.g., float32, float64, complex64, or complex128).

has_unsigned_int_dtype(x: Union[array, dtype]) -> bool

Returns a bool indicating whether an input array has an unsigned integer data type (e.g., uint8, uint16, uint32, uint64).

has_signed_int_dtype(x: Union[array, dtype]) -> bool

Returns a bool indicating whether an input array has a signed integer data type (e.g., int8, int16, int32, int64).

has_int_dtype(x: Union[array, dtype]) -> bool

Returns a bool indicating whether an input array has an integer (signed or unsigned) data type.

has_real_dtype(x: Union[array, dtype]) -> bool

Returns a bool indicating whether an input array has a real-valued (integer or floating-point) data type.

has_bool_dtype(x: Union[array, dtype]) -> bool

Returns a bool indicating whether an input array has a boolean data type.


The above APIs cover the list of data types currently described in the specification, are sufficiently specific to cover most use cases, and can be composed to address most anticipated data type set combinations.

kgryte avatar May 05 '22 07:05 kgryte

Also related ( https://github.com/data-apis/array-api/issues/152 )

Edit: If we look into type naming (briefly discussed), this discussion around typing naming in Zarr may be of interest ( https://github.com/zarr-developers/zarr-specs/issues/131 )

jakirkham avatar May 05 '22 17:05 jakirkham

Like can_cast() or result_type(), could these utils take both dtypes and arrays? I'd personally want these utils for dtype objects themselves, but definitely my own use cases are not quite aligned with most array consumers.

honno avatar May 18 '22 09:05 honno

Update: I've updated the OP as follows based on feedback here and in the last array API consortium meeting.

  1. Functions now accept both arrays and dtypes.
  2. Function names include a _dtype suffix (as suggested during the consortium meeting)
  3. Function names begin with a has_ prefix. This helps avoid conflicts with existing APIs (e.g., PyTorch) and matches how one might describe an array (e.g., has shape X, has data type Y, etc).
  4. Included both real-valued and generic float APIs to match specification data type categories.
  5. Included a generic real dtype API to match specification data type categories.

kgryte avatar May 23 '22 09:05 kgryte

Some more prior art:

  • TensorFlow has methods like is_bool on DType objects: https://www.tensorflow.org/api_docs/python/tf/dtypes/DType
  • JAX doesn't have anything other than a subset of numpy APIs (result_type, can_cast, promote_dtypes): https://jax.readthedocs.io/en/latest/_modules/jax/_src/dtypes.html
  • NumPy issue on the too many ways of comparing: https://github.com/numpy/numpy/issues/17325

As this specification has favored functional APIs

Given that dtype objects are immutable and have no state, this should also work for JAX et al. Not saying that that's my preference (I'm not yet sure), but this RFC proposes a lot of functions ...

3 underscores in a name like has_real_float_dtype is also not ideal.

rgommers avatar May 23 '22 18:05 rgommers

Given that dtype objects are immutable and have no state, this should also work for JAX et al. Not saying that that's my preference (I'm not yet sure), but this RFC proposes a lot of functions ...

Whether methods or functions, surface area would be the same. The list can obviously be culled; however, I do think there is some advantage to matching the categories as used in the spec, especially for providing consistent APIs for input argument validation.

3 underscores in a name like has_real_float_dtype is also not ideal.

The number of underscores is not super important, IMO. Instead, we're probably concerned about number of characters. Originally, I left out the _dtype suffix, which would reduce the function name length; however, consortium members voiced desire for such a suffix in the array API meeting.

I don't have a strong opinion here; although, the current naming convention is arguably more literate.

kgryte avatar May 23 '22 19:05 kgryte

Silly question, why not do:

if array.dtype in <set_of_dtypes>:
    ...

and require implementations to provide some predefined sets, such as "set of all supported integer dtypes" or "set of all supported floating point dtypes"?

vnmabus avatar Jul 14 '22 07:07 vnmabus

and require implementations to provide some predefined sets, such as "set of all supported integer dtypes" or "set of all supported floating point dtypes"?

That does seem more appealing indeed; it's what can already be done today and it reads fairly well. I think I prefer that over both the has_* functions in this proposal and the numpy issubdtype design.

rgommers avatar Jul 18 '22 17:07 rgommers

I don't like issubdtype. For NumPy, I could imagine isinstance(arr.dtype, InexactDType) (or similar). So that way the API here would be isinstance(arr.dtype, some_object). The problem is that I am not sure if an isinstance API would work for everyone.

For arr.dtype in set_of_dtypes there are ~two~ three things to keep in mind:

  1. The set_of_dtypes will be different for each library, because bfloat16, float16, and others do not exist for implementers. Implementers can extend the API after all.
  2. For NumPy, users may extend the API reasonably soon. For example adding bfloat16 or a multi-precision float object.
  3. Sets might be tricky right now NumPy in either case (although that could likely be made sure to work). There are arbitrary number of possible instances for dtypes, although they should compare equal with a limited set, that set is confusingly large (byte-order matters).

I do think neither of these is particularly problematic. But, I would say that this would not be a set, but rather an opaque object that supports the in operator.

seberg avatar Jul 18 '22 17:07 seberg

A minor pro of dtype sets is that it could be a way for a library to communicate what dtypes they support—thinking of PyTorch and it only supporting uint8 unsigned integers. Useful here and there, like telling Hypothesis to not try generating uint{16/32/64}.

honno avatar Jul 21 '22 17:07 honno

@leofang point out that this is blocking for adding real and conj (and I imagine imag too), it'd be great to finalize this. The majority of folks who have weighed in seem to prefer a set/collection type of approach. So here's a suggested API for that, in line with @seberg's last sentence above.

  1. There must be objects integer_dtypes, floating_point_dtypes, and complex_dtypes,
  2. The syntax dtype in xxx_dtypes must yield a boolean value with the expected result (to be detailed out more in the spec),
  3. xxx_dtypes must contain all the expected dtypes that are part of the standard, and may contain additional dtypes of the same kind
  4. The objects may be of any kind, e.g. a set or a custom class instance.

Other thoughts:

  • No object for boolean is needed, because bool already supports __eq__, so array.dtype == bool is enough.
  • Also no separate signed/unsigned integer objects, because that's a bit much for the API / less needed. This is mostly a convenient way to spell array.dtype in (dtype1, dtype2, ...) anyway.
  • The one name where there's not a single obvious choice is floating_point_dtypes. It could also be float_dtypes, floating_dtypes, or real_dtypes for example.
  • Not specifying the type of these objects is on purpose, to make it easy to for example have an API that adds user-defined dtypes in.
    • That means that for static typing we need another Protocol. Not completely ideal, but imho better than restricting implementation choices for libraries (see point 3 in @seberg's comment above about why set is tricky for NumPy).

One alternative with a similar API surface is to add 3 functions with the same functionality instead. Those functions could be 3 of the ones in the issue description here (e.g., has_integer_dtype, has_floating_point_dtype, has_complex_dtype). Considerations:

  • Pro: it's better for static typing,
  • Con: it introduces an asymmetry between supported and unsupported sets - we need the dtype in xxx anyway when the predefined objects aren't the right ones.

I think the con is more important than the pro here. But I'd say either choice is pretty reasonable here.

rgommers avatar Sep 06 '22 17:09 rgommers

Just to make sure @leofang, both flavors are fine for accelerators, right? When the spec says something should return a bool, that's not a problem - only Python control flow like if _expr_yielding_a_bool is. So a function is not preferred from that perspective. Or maybe there's a significant amount of extra implementation complexity for the dtype in xxx_dtypes version?

rgommers avatar Sep 06 '22 17:09 rgommers

My preference would be to match more closely the spec on this. Namely, have the following objects:

  • numeric_dtypes: int8...64, uint8...64, float32/64, complex64/128
  • real_dtypes: int8...64, uint8...64, float32/64
  • float_dtypes: float32/64, complex64/128
  • real_float_dtypes: float32/64
  • complex_float_dtypes: complex64/128
  • integer_dtypes: int8...64, uint8...64

This would mean 6 objects, which would, as it stands now, cover almost the entirety of the spec. As these are relatively trivial to implement and expose, I don't see this as imposing an undue burden on array libraries.

However, if only integer, float, and complex, their repetition in order to generate composite groups matching the spec in userland and library implementations would be mildly annoying and would possibly just lead array libraries to implement the composite groups anyway.

E.g., suppose we want to validate an array for a function which supports all numeric dtypes. With just integer, float, and complex collections, I'd need to do

def foo( x: array ):
    dt = x.dtype
    if dt in integer_dtypes or dt in float_dtypes or dt in complex_dtypes:
        ...

Given the opacity of what's intended in the conditional, one might be tempted to write a small helper function transforming the check to something more literate. And given the ubiquity of composite dtype categories in the spec, I'd argue we should just include the composite groups in the spec directly so that array library clients don't need to reimplement these groups from library to library.

kgryte avatar Sep 08 '22 06:09 kgryte

E.g., suppose we want to validate an array for a function which supports all numeric dtypes. With just integer, float, and complex collections, I'd need to do

This is a good point. Although in general this isn't done for library code, even if the library provided string/object/etc. dtypes. It is difficult to pick the right sets here.

My preference would be to match more closely the spec on this. Namely, have the following objects:

I don't think that will work, the names don't map to current practice and are not intuitive enoug. float_dtypes in particular is bad. See torch.is_floating_point and for numpy:

>>> x = np.ones(2, dtype=np.float64)
>>> x2 = np.ones(2, dtype=np.complex128)
>>> np.issubdtype(x.dtype, np.floating)
True
>>> np.issubdtype(x2.dtype, np.floating)
False

rgommers avatar Sep 08 '22 06:09 rgommers

Understood. We're not starting from a blank slate. Although, presumably, at least for Torch, the need for is_complex and is_floating_point would no longer exist, opening up a path to eventual deprecation.

For NumPy, well, 🤷‍♂️.

The notion of what is considered a "floating-point" dtype arose previously in the consortium. Then, it was decided that under the umbrella of floating-point are both real and complex. Hence, the OP.

Unfortunately, however, I don't have, atm, a more intuitive name for "real + complex floating-point dtypes", but I don't think this negates the general desirability of composite groups.

kgryte avatar Sep 08 '22 07:09 kgryte

NumPy calls real + complex floating point types inexact (not that the name is used often). The only problem with using floating point for both is that people may just not think about complex at all, and it may be a false friend in practice.

I have to think some more about the API choices we have here. I currently think that whatever definition we have for floating point dtypes should not be fixed to the standard? (Rather it should be valid to extend it e.g. with bfloat16?)

The last point might mean that we should have an API to allow users to spell isdtype(arr.dtype, (float32, float64)), or isofdtype(arr, (float32, float64)).

Annoyingly, NumPy doesn't even have a good way to spell it. Basically, you would first use issubdtype or dtype.kind == "f" to filter out floating (not future proof/extensible). And then use can_cast or <= on the dtypes, because == on dtypes is too strict to be useful. (dtype "equality" is in practice more like dtype1 <= dtype2 and dtype2 <= dtype1 rather than dtype1 == dtype2)

seberg avatar Sep 08 '22 07:09 seberg

I have to think some more about the API choices we have here. I currently think that whatever definition we have for floating point dtypes should not be fixed to the standard? (Rather it should be valid to extend it e.g. with bfloat16?)

It should always be the case that the standard says "must contain (x, y, z)" - and either explicitly saying or implying that it's fine to also contain other things.

The last point might mean that we should have an API to allow users to spell isdtype(arr.dtype, (float32, float64)), or isofdtype(arr, (float32, float64)).

Yeah, I was just thinking in that direction as well. If it's hard to figure out what sets we may need, plus we need it to be easy to extend, plus we have naming issues with "float", then perhaps something like this which is concise and explicit:

def has_dtype(x: Union[array, dtype], kind: Union[str, tuple[Union[str, dtype], ...]]) -> bool):
    """
    Examples
    ----------
    >>> has_dtype(x, 'integer')
    >>> has_dtype(x, 'real')
    >>> has_dtype(x, 'complex')
    >>> has_dtype(x, ('real, 'complex'))  # avoid both 'floating' and 'inexact', those are not good names
    >>> has_dtype(x, 'numeric')  # shorthand for ('integer', 'real', 'complex')
    >>> has_dtype(x, 'signed integer')
    >>> has_dtype(x, 'unsigned integer')
    >>> has_dtype(x, (float64, complex128))
    >>> has_dtype(x, int32)  # supports dtype subclasses for libraries that support those, unlike `== int32'
    """

A couple of thoughts for why this may be nice:

  • It's quite different from existing APIs, so no introduction issues or users confusing it with very similarly-named APIs,
  • It's pretty concise and explicit,
  • It avoids confusion around what "floating" or "floating-point" means,
  • It's only a single API addition - I think this is helpful; 3 would still be okay but appears to not be enough, and 6-8 or more is too much imho,
  • It's extensible, unlike any fixed collections of dtypes encoded in API name.

rgommers avatar Sep 08 '22 08:09 rgommers

I will note that I am slightly unsure about the Union[array, dtype]. Not a concrete concern though; it is probably that NumPy is quite relaxed about what it accepts as a dtype (and also array-like...), which makes these unions feel brittle/unclear to me.

seberg avatar Sep 08 '22 08:09 seberg

Yeah, I was just thinking in that direction as well. If it's hard to figure out what sets we may need, plus we need it to be easy to extend, plus we have naming issues with "float", then perhaps something like this which is concise and explicit

  • In case we go in that direction, I would not limit the type of kind to tuples, but I would accept any kind of collection.
  • If there are in the future API functions that can only be applied to a particular dtype, and we want Mypy to warn about it, with the set approach we could define a Protocol to tag the different kinds of types, and defining the different sets with the appropriate Protocol type would make Mypy narrow types after the check, I think. With this approach has_dtype must return a TypeGuard to do the narrowing, that depends on the kind parameter. It could be done with overloads except for combination of strings like ("real", "complex") AFAIK.

I think all the examples except the last (which IMHO seems like a different thing) can be done with sets:

x in xp.integer_dtypes
x in xp.real_dtypes
x in xp.complex_dtypes
x in xp.real_dtypes | xp.complex_dtypes
x in xp.numeric_dtypes
x in xp.integer_dtypes - xp.unsigned_dtypes # If we only want to add unsigned as a special set
x in xp.integer_dtypes & xp.unsigned_dtypes # I don't think there are unsigned dtypes that are not integers but just to be sure
x in {xp.float64, xp.complex128}
x == xp.int32 or (isinstance(xp.int32, type) and isinstance(x, xp.int32))

vnmabus avatar Sep 08 '22 08:09 vnmabus

That does look nicer syntactically, thanks @vnmabus. Rather than x in xp.integer_dtypes it should be

x.dtype in xp.integer_dtypes

then I think - we can do the union of array and dtype in a function, but not if it's a set. Which is perfectly fine. The main thing that won't work I believe is dtype subclasses. dtype in a_set should compare with ==, not with isinstance. And explicit isinstance cannot work:

>>> int32 = 'int32'
>>> np.int32 = 'int32'  # example to simulate a library with string identifiers for dtypes
>>> isinstance(int32, np.int32)
Traceback (most recent call last):
  Input In [11] in <cell line: 1>
    isinstance(int32, np.int32)
TypeError: isinstance() arg 2 must be a type, a tuple of types, or a union

rgommers avatar Sep 08 '22 09:09 rgommers

I will note that I am slightly unsure about the Union[array, dtype]

I mentioned this in a call a few weeks back: the pandas is_foo_dtype functions accept both and that is a design choice we have largely come to regret. The performance overhead of checking for a .dtype attribute adds up quickly.

jbrockmendel avatar Sep 08 '22 15:09 jbrockmendel

Good point @jbrockmendel. I never noticed it in numpy, but did a quick check and yes these checks are expensive (still fast though):

>>> import numpy as np
>>> real_dtypes = {np.float16, np.float32, np.float64, np.longdouble}
>>> %timeit np.float64 in real_dtypes
48.7 ns ± 0.211 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
>>> %timeit np.issubdtype(np.float64, np.floating)
257 ns ± 2.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

The main thing that won't work I believe is dtype subclasses.

To circle back to this, this is true only for user-defined dtypes. If those exist (which may be unique to numpy), it's perhaps okay to then require registering them somehow so they get added to xxx_dtypes.

rgommers avatar Sep 08 '22 15:09 rgommers

Just to be sure, custom dtypes (working for each backend) won't be ever added to the standard, right? It would be great to be able to implement the logic for a custom dtype once and have it working everywhere, but probably that would be difficult to standardize.

I saw that for units, for example, you recommended to wrap the API backends instead in https://discuss.scientific-python.org/t/advice-and-guidance-about-array-api-for-a-units-package/.

vnmabus avatar Sep 08 '22 16:09 vnmabus

I think it's safe to say that custom dtype support won't be added. Most libraries don't have it, and for NumPy it's still a work-in-progress to define a good API with all the functionality downstream library authors may need.

That said, it would be nice that standard-compliant code like x.dtype in real_dtypes in libraries like SciPy and scikit-learn will work for those NumPy users that do end up creating their own dtype. I think it will, as long as NumPy has an API that allows those users to extend real_dtypes with their new dtype.

rgommers avatar Sep 08 '22 16:09 rgommers

Let me make a few points for why I am leaning against the set approach, although it is still not quite clear cut:

  • The non-set approach is similar to isinstance. I am not sure the set approach has a clear inspiration e.g. in typing? (The notation of using set operations has, but checking with in?)
  • The set approach just feels a bit too smart to me. :)
  • If floating_dtypes is just a set/tuple (NumPy cannot do that, I think), it is not clear that arr in floating_dtypes would raise an error rather than always returning False (it must be arr.dtype in floating_dtypes).
  • In NumPy, I can see things like is_of_dtype/has_dtype(1., np.floating) making sense. Where 1. is actually just a Python float. Allowing to generalize "dtype checking" to objects that may not have a .dtype attribute. Yes, this would be to have better support of scalars, which is something that ideally are not supposed to exist here. Would this be useful e.g. for pandas, @jbrockmendel (since I always wonder if pandas has more need of scalars than an array API)?
  • In NumPy it would be nice to use np.floating for this, but that is also the scalar type, which may lead to a bit strange overloading. If we have to functions (has_dtype and is_dtype) that becomes unproblematic. (An error could point to the other where appropriate.)

In the end, I am not certain yet that the set approach works well for NumPy proper. Of course that is not actually a blocker for this API since there can be differences.

seberg avatar Sep 13 '22 08:09 seberg

Would this be useful e.g. for pandas, @jbrockmendel (since I always wonder if pandas has more need of scalars than an array API)?

IIUC, I don't think it's likely pandas would change our current usage

jbrockmendel avatar Sep 13 '22 15:09 jbrockmendel

Thanks, @seberg, for the nice thoughts. Just wanna add a quick note.

  • If floating_dtypes is just a set/tuple (NumPy cannot do that, I think), it is not clear that arr in floating_dtypes would raise an error rather than always returning False (it must be arr.dtype in floating_dtypes).

This is very nice point. It seems floating_dtypes cannot be a plain set/tuple, but at least a subclass of them with a custom __contains__ first checking the type of the object before delegating to the in check of the parent class.

  • In NumPy, I can see things like is_of_dtype/has_dtype(1., np.floating) making sense. Where 1. is actually just a Python float. Allowing to generalize "dtype checking" to objects that may not have a .dtype attribute.

Also a very good point. Since we include the Python types in the type lattice, I think it is legitimate to do the said check even if we don't plan to support scalars.

leofang avatar Sep 13 '22 16:09 leofang

So it looks like we're (a) leaning towards the single-function version, and (b) only have it accept either a dtype or an array (avoiding the union of both).

For (b), most of the time the thing to check is an array. However, dtype checking is also needed, and getting a dtype from an array is trivial while an array from a dtype is not. If the input was an array, has_dtype is a logical name. If it's a dtype, I think is_dtype is better. That is also a name that AFAIK isn't used anywhere.

So we'd be looking at some flavor of:

def is_dtype(x: dtype, kind: Union[str, dtype, tuple[Union[str, dtype], ...]]) -> bool:
    """
    >>> is_dtype(x, 'integer')
    >>> is_dtype(x, 'real')
    >>> is_dtype(x, 'complex')
    >>> is_dtype(x, ('real, 'complex'))  # avoid both 'floating' and 'inexact', those are not good names
    >>> is_dtype(x, 'numeric')  # shorthand for ('integer', 'real', 'complex')
    >>> is_dtype(x, 'signed integer')
    >>> is_dtype(x, 'unsigned integer')
    >>> is_dtype(x, float32)
    >>> is_dtype(x, (float64, complex128))
    """

or

def is_dtype(x: dtype, kind: str) -> bool:
    """
    >>> is_dtype(x, 'integer')
    >>> is_dtype(x, 'real')
    >>> is_dtype(x, 'complex')
    >>> is_dtype(x, 'numeric') 
    >>> is_dtype(x, 'signed integer')
    >>> is_dtype(x, 'unsigned integer')
    """

or something in between (e.g, kind: str | dtype]).

Looking at the np.issubdtype usage in SciPy, there's a roughly equal mix between checking against a set of dtypes (e.g., np.issubdtype(dtype, np.complexfloating)) and checking against a single dtype (e.g., np.issubdtype(dtype, np.int32)). Both seem kinda useful. A combination (tuple of sets/dtypes) is probably not necessary.

So perhaps this is the way to go: ?

def is_dtype(x: dtype, kind: str | dtype) -> bool:

rgommers avatar Sep 20 '22 19:09 rgommers

We had another look at this yesterday. We want to go for a flavor of the function-based implementation here; there was no clear preference for which of the above was preferred. So let's try a vote - use emoji's on this comment:

  • 👍🏼 if you prefer is_dtype(x: dtype, kind: str)
  • 🎉 if you prefer is_dtype(x: dtype, kind: str | dtype)
  • 🚀 if you prefer is_dtype(x: dtype, kind: str | dtype | tuple[Union[str, dtype], ...])

rgommers avatar Oct 07 '22 10:10 rgommers

I know I'm very late to this discussion, but as the array API is now implemented in NumPy, I've been exploring what it would do to my code.

A couple problems with is_dtype over ordinary Python types (proposed by seberg) is that:

  • is_dtype uses strings, which can be error prone. If you have a typo, it may not be caught until you run your program and run into the offending code. Yes, you can annotate is_dtype with Literal, but if the type codes are passed from other functions (as str), then the validation won't happen. It feels more ergonomic to me to have objects rather than magic strings.

    One of the things I love about the Array API is the constrained interface that feels way less bug prone. It's not a huge burden to have to import a special object instead of using a string, and it prevents mistakes and allows type-checkers to find errors. It's the same reason people generally prefer enumeration objects over strings.

  • The various kinds cannot be checked by type checkers. Right now, it's possible to annotate an array as numpy.typing.NDArray[np.floating[Any]]. I do this for various numpy array types, and this catches many bugs thanks to numpy's excellent implementation of type annotations. If you don't provide base classes, then how are you supposed to have these annotations?

    If I were to vote, I would have voted for:

    is_dtype(x: dtype, kind: dtype | tuple[dtype, ...])
    

    which may as well have been written as simply issubclass(x.type, kind).

I personally prefer seberg's proposal to use ordinary Python issubclass with a tree of Python types. Any thoughts on this? With is_dtype, how can I accomplish the above type annotations?

NeilGirdhar avatar Dec 23 '22 13:12 NeilGirdhar

It's the same reason people generally prefer enumeration objects over strings.

I think this isn't really true? At least, I can't think of many APIs where enums are common, while I can think of lots of libraries that use string args for keywords.

Enums have a major design flaw - namespace cluttering. Imho that is far more important, also for ergonomics, than static type checking.

The array API standard doesn't have many strings, but if NumPy had enums instead of strings or True/False/None keywords everywhere, that would be hundreds of extra objects.

With is_dtype, how can I accomplish the above type annotations?

I think we still have a more fundamental issue to solve: how to annotate arrays themselves. This should be done using a Protocol I believe, see gh-229.

The same will apply to other objects. Giving that we have to be concerned about usage by consuming libraries and end users in an array-library-agnostic way, where it's effectively impossible for objects to have a class relationship, this is nontrivial to design. We haven't spent a whole lot of time on that aspect yet - and we should do that.

The array[dtype] is one level more complex. And it's not just dtype, there's also device, dimensionality, etc. Even in NumPy this is still very much a work in progress. It's probably best to split that off into a new issue - I don't think dtypes having a class hierarchy or not is the primary issue here.

rgommers avatar Dec 23 '22 14:12 rgommers