scipy icon indicating copy to clipboard operation
scipy copied to clipboard

RFC: SciPy array types & libraries support

Open rgommers opened this issue 2 years ago • 58 comments

This is a long read and a proposal with quite a large scope and a decent amount of backwards-compatibility impact. I hope it'll make SciPy's behavior much more consistent and predictable, as well as yield significant performance gains. I'll post this on the mailing list too. I'd suggest to discuss the big picture first; will ask for process suggestions on the mailing list too in case we'd like to split up the discussion, open a doc/proposal PR for detailed review, or anything like that.

The basic design principle I propose aiming for in all of SciPy for array libraries is: container type in == container type out.

Python sequences (lists, tuples, etc.) will continue to be converted to NumPy arrays, as will other unknown types that are coercible with np.asarray.

The scope of this proposal is: how to treat array inputs. This includes different kinds of NumPy arrays and ndarrays with particular dtypes, as well as array/tensor objects coming from other libraries.

Out of scope for this proposal are (a) dataframe libraries, and (b) implementation of a dispatch mechanism for when non-numpy array types hit C/C++/Cython/Fortran code inside SciPy. Both of these topics are touched upon in the Appendix section.

I'll dive straight into the design here; for context on why we'd want/need this design or what has been done and discussed before, see the Context and Problems & opportunities sections further down.

array/tensor types support

Array types and how to treat them:

Input type Return type When Behavior Notes
numpy.ndarray numpy.ndarray always
torch.Tensor (CPU) torch.Tensor always
torch.Tensor (GPU) torch.Tensor when possible raise TypeError otherwise for pure Python implementation for now. See text below for more details
cupy.ndarray cupy.ndarray when possible raise TypeError otherwise
has __array_namespace__ same as input type when possible raise TypeError otherwise
array with object dtype n/a raise TypeError
numpy.matrix n/a for NumPy >= 2.0 (or always?) raise TypeError
numpy.ma.MaskedArray numpy.ma.MaskedArray only for stats.mstats raise TypeError otherwise future nan_policy/mstats plans are unaffected by this
numpy.ndarray subclasses same subclass type always so always use np.asanyarray checks
torch.Tensor subclasses same subclass type same as for torch.Tensor on CPU/GPU
PyData/Sparse arrays same sparse array type once it has array API support TypeError otherwise TypeError is already the current state; no auto-densification
dask.Array dask.Array once it has array API support TypeError otherwise (?) bc-breaking, but mixing with current conversion to numpy array may not be healthy
jax.numpy.Array jax.numpy.Array once added to array-api-compat same as PyTorch for CPU and GPU

When a non-NumPy array type sees compiled code in SciPy (which tends to use the NumPy C API), we have a couple of options:

  1. dispatch back to the other library (PyTorch, CuPy, etc.).
  2. convert to a NumPy array when possible (i.e., on CPU via the buffer protocol, DLPack, or __array__), use the compiled code in question, then convert back.
  3. don't support this at all

I'll note that (1) is the long-term goal; how to implement this is out of scope for this proposal - for more on that, see the Appendix section. For now we choose to do (2) when possible, and (3) otherwise. Switching from that approach to (1) in the future will be backwards-compatible.

A note on numpy.matrix: the only place this is needed is scipy.sparse, it can be vendored there and for NumPy >= 2.0 instances of the vendored matrix code can be returned. That allows deprecating it in NumPy. We need to support scipy.sparse.*_matrix long-term for backwards compatibility (they're too widely used to deprecate), however for new code we have sparse.*_array instances and PyData/Sparse.

Regarding array API support: when it's present in an array library, SciPy will require it to be complete (v2022.12), including complex number support and the linalg and fft submodules - supporting partial implementations without linalg support in particular seems unnecessary.

For as-yet-unsupported GPU execution when hitting compiled code, we will raise exceptions. The alternative considered was to transfer to CPU, execute, and transfer back (e.g., for PyTorch). A pro of doing that would be that everything works, and there may still be performance gains. A con is that it silently does device transfers, usually not a good idea. On balance, something like this is only a good idea if there's a well-defined plan to make GPU execution work for most functionality on a reasonable time scale (~12-18 months max). Which means both addressing the dispatch (uarray & co) problem that I am trying hard to avoid diving into here, and having time commitments for doing the work. Since we don't have that, raising exceptions is the way to go.

Development and introduction strategy

The following strategy is meant to allow implementing support in a gradual way, and have control over when to default to a change in behavior - which necessarily is going to have some backwards compatibility impact.

  • develop behind an experimental flag (an environment variable and a global setting starting with an underscore)
  • copy the scikit-learn approach, using array-api-compat as an optional dependency for now
  • switch the new behavior of "input type in == input type out" by default once a submodule has complete support
    • at that point, vendor array-api-compat
    • may be a major release if it can all be done in one release cycle. otherwise do submodule by submodule
  • explicit support can and should be tested in CI for: PyTorch CPU tensors, numpy.array_api for compliance testing in APIs that start supporting array API compliant input, and perhaps pandas Series/DataFrame. GPU CI may materialize at some point, but let's not count on or plan for that.
  • implement new type coercion machinery centrally in scipy._lib and depend on that from other submodules. That means we'll have a single replacement for np.asarray/np.asanyarray in one central location.

Context

  • For NumPy 2.0 we should have full array API standard support in the main numpy namespace
  • The numpy.array_api module is useful for testing that code is actually using only standard APIs, because it's a minimal implementation. It's not going to be used as a layer for production usage though, it's for portability testing only.
  • The array-api-compat library provides the compatibility we need right now between NumPy, CuPy and PyTorch. Other libraries can be added there too. This bridges any gaps between the standard and real-world implementations. The package is light-weight and can be depended on - but is also designed to be vendored to avoid a new runtime dependency.
  • scikit-learn has had experimental array API support for a little while, and has now settled on array-api-compat as the way to go. This looks quite a bit nicer than the previous approach with numpy.array_api, and the work to support PyTorch that way has been quite well-received.
  • As for the myriad of open NEPs on interoperability topics:
    • NEP 30 (__duckarray__), NEP 31 (uarray) and NEP 37 (__array_module__) are all effectively dead - I'll propose rejecting these.
    • NEP 47 (array API support via numpy.array_api) has been implemented, however I'm going to propose marking it superceded via a new NEP for the main numpy namespace
    • __array_function__ and __array_ufunc__ are fully implemented and will continue to exist and be supported in NumPy. We won't support those mechanisms in SciPy though, since we are coercing unknown input types to ndarray and error out if that fails. The exception here is ufuncs in scipy.special, which happen to work already because we're reusing the numpy ufunc machinery there. We can probably leave that as is, since it's not problematic.
  • NumPy array subclasses are a long-standing problem, with masked arrays and np.matrix instances not being Liskov-substitutable (i.e. it's a drop-in replacement, no changes in behavior but only extensions), making it difficult to accept them. We can explicitly start rejecting those with clear exceptions, that will make regular subclasses a lot more useful.
  • Missing data support is a broad and tricky subject:
    • numpy.ma has tons of issues and isn't well-maintained. There's a full rewrite floating around that is ~90% complete with some luck will make it into NumPy 2.0. However, it hasn't seen movement for several years, and the work on that is not planned. numpy.ma in its current form should be considered legacy.
    • scipy.stats.mstats is the only module that specifically supports masked arrays.
    • scipy.stats has a nan_policy keyword in many functions that is well-maintained at this point, and has a documented specification. That is probably not applicable to other submodules though.
    • Support for missing data in other places like interpolate and spatial.KDTree (see gh-18230) may make sense, but ideally that'd use solid numpy support (e.g., via a null-aware dtype) and that does not exist.
    • Dataframe libraries have much better support, but it's a difficult story to use non-numpy-backed dataframes at all.
    • For the purposes of this proposal, I'd prefer leaving it at "reject numpy.ma.MaskedArray instances".

Problems & opportunities

The motivation for all the effort on interoperability is because the current state of SciPy's behavior causes issues and because there's an opportunity/desire to gain a lot of performance by using other libraries (e.g., PyTorch, CuPy, JAX).

Problems include:

  • pandas input with non-numpy dtypes being converted to object arrays and then support being very hit-and-miss; see for example gh-18254
  • general pandas dataframe/series support being patchy; see for example gh-9616 and gh-12164
  • masked arrays giving incorrect results because the mask is silently dropped; see, e.g., gh-1682 and gh-12898
  • sparse array support being patchy; agreement that auto-densification needs to be avoided, but that means current behavior is inconsistent (xref gh-4239)
  • object arrays, including use of Decimal and other random things that people stuff into object arrays on purpose or (more likely) by accident, being handled very inconsistently.

Opportunities include:

  • Large performance gains. See for example mdhaber/scipy#63 (10x-50x for stats bootstrapping functionality with CuPy) and scikit-learn#25956 which shows how PyTorch improves over NumPy (2x-7x on CPU, 60x-100x on GPU)
  • interest in support for Dask which we haven't really been able to do much with, see for example gh-10204, gh-8810 and gh-10087. Xarray is in the same boat, it builds on top of Dask (see, e.g., gh-14824)

References

Appendix

Note: these topics are included for reference/context because they are related, but they are out of scope for this proposal. Please avoid diving into these (I suggest to ping me directly first in case you do see a reason to discuss these topics).

dataframe support

For dataframe library support, the situation is a little trickier. We have to think about pandas Series and DataFrame instances with numpy and non-numpy dtypes, the presence of nullable integers, and other dataframe types which may be completely backed by Apache Arrow, another non-numpy library (e.g., cuDF & CuPy), or have implemented things completely within their own library and may or may not have any NumPy compatibility.

This would be one option, which is somewhat similar to what scikit-learn does (except, it converts nullable integers to float64 with nans):

Input type Return type When Behavior Notes
pandas.Series pandas.Series if dtype is a numpy dtype raise TypeError otherwise non-NumPy dtypes get coerced to object arrays otherwise, avoid that
pandas.DataFrame pandas.DataFrame if dtype is a numpy dtype raise TypeError otherwise
has __dataframe__ same as input type needs a small extension to dataframe interchange protocol to reconstruct input type

There's a lot to learn from the effort scikit-learn went through to support pandas dataframes better. See, e.g., the scikit-learn 1.2 release highlights showing the set_output feature to request pandas dataframe as the return type.

Note: I'd like to not work this out in lots of detail here, because it will require time and that should not block progress on array library support. I just want to put it on the radar, because we do need to deal with it at some point; current treatment of dataframes is quite patchy.

Dispatching mechanism

For compiled code, other array types (whether CPU, GPU or distributed) are likely not going to work at all; the SciPy code is written for the NumPy C API. It's not impossible that some Cython code will work with other array types if those support the buffer protocol and the Cython code uses memoryviews - but that's uncommon (won't work at all on GPU, and PyTorch doesn't support the buffer protocol on CPU either).

There has been a lot of discussion on how this should work. The leading candidate is Uarray, which we already use in scipy.fft (as do matching fft APIs in CuPy and PyFFTW) and has other PRs pending in both SciPy and CuPy. However, there is also resistance to that because it involves a lot of complexity - perhaps too much. So significant work is needed to simplify that. Or switch to another mechanism. This is important work that has to be done, but I'd prefer not to mix that with this proposal.

Whatever the mechanism, it should work transparently such that scipy.xxx.yyy(x) where x is a non-numpy array should dispatch to the library which implements the array type of x.

We have a uarray label in the issue tracker. See gh-14353 for the tracker with completed and pending implementations. For more context and real-world examples, see:

  • https://labs.quansight.org/blog/pydata-extensibility-vision (and in particular the PyData dispatching system section which covers alternative dispatching options)
  • https://labs.quansight.org/blog/making-gpus-accessible-to-pydata-ecosystem-via-array-api
  • https://quansight-labs.github.io/array-api-demo/GW_Demo_Array_API.html
  • https://data-apis.org/array-api/latest/use_cases.html
  • https://github.com/numpy/archive/blob/main/other_meetings/2020-04-21-array-protocols-discussion-notes.md
  • Note that there's work to be done to write SPEC-0002 on this dispatching topic, see also Discourse posts on that: here, here and here.
  • Scikit-learn is working on a different kind of system for this, with explicit plugins to extend scikit-learn with support for other libraries/devices where it's using compiled code - see https://github.com/scikit-learn/scikit-learn/issues/22438.
  • PyTorch has a more robust backend system (https://pytorch.org/docs/stable/dynamo/custom-backends.html).
  • I'm sure there more examples of how to enable this kind of thing, inside and outside of Python. It's just a lot of engineering effort.

rgommers avatar Apr 12 '23 16:04 rgommers

I read it over, sounds exciting. I have to admit that evaluating API designs on this scale is not one of my strengths. I'll briefly mention that the US National Labs will probably quietly test out their own array API compatible technologies behind the scenes as well, if this moves forward, to see if performance gains are observed on bleeding-edge supercomputer hardware. One question I have related to this--for nodes with multiples devices (like 4 x A100 is pretty common these days), do we leave the use of multiple devices up to the GPU library under the hood? Do we explicitly not encourage using kwargs for controlling concurrency that is specific to say GPU libs (i.e., too much risk of optional kwarg API bloat)?

You did briefly mention testing a few different CPU-based scenarios by default. I'm curious if we'd use a pytest decorator/magic to auto-expand eligible tests and/or if it might be helpful to be able to have i.e., a testing env variable so that we can request particular "backends" for testing (for example, could be useful to be able to first iteratively develop/test CPU-only, then see what kind of mess arises when you switch to testing with GPU locally/on a cluster node, etc.). That's perhaps more of a detail.

One thing that may be more general re: testing is how we might handle bug reports that are effectively isolated to a subset of pass-throughs--only library X with hardware Y. I suppose the array API is supposed to prevent that from happening, but there is also reality... I guess maybe occasional shims and regression tests that only run with the conditions are right, assuming that we can't just consider these upstream issues most of the time. I will say that I don't see any way for this overall improvement to make our testing situation any simpler, unfortunately/obviously.

Maybe after the first few PRs go in, it will be easier for those of us who aren't familiar with every corner of those NEPs to start contributing since we might be able to understand the changes needed as basic engineering tasks.

I wonder if spatial will unfortunately be one of the least-beneffiting modules--some of our most-used functionality relies heavily on carefully-written compiled backends, like Qhull, the C++ stuff driving KDTree, and the C++ backing much of the distance calculation code. I believe there are a few "Python/NumPy land" corners at least.

tylerjereddy avatar Apr 12 '23 21:04 tylerjereddy

What is the policy for handling mixed inputs when a function accepts more than one argument? This is probably covered in one of the array API documents or discussions, but I think it we need it stated explicitly here, too. I expect the most common use-case is mixing $SOME_OTHER_ARRAY_TYPE and standard numpy arrays (e.g. multiplying a sparse matrix or sparse array by a numpy array, convolving some other array type with a kernel that is a numpy array, etc.).

WarrenWeckesser avatar Apr 12 '23 21:04 WarrenWeckesser

What is the policy for handling mixed inputs when a function accepts more than one argument? This is probably covered in one of the array API documents or discussions, but I think it we need it stated explicitly here, too. I expect the most common use-case is mixing $SOME_OTHER_ARRAY_TYPE and standard numpy arrays (e.g. multiplying a sparse matrix or sparse array by a numpy array, convolving some other array type with a kernel that is a numpy array, etc.).

The array API spec says:

Non-goals for the API standard include: ...

  • Making it possible to mix multiple array libraries in function calls.

    Most array libraries do not know about other libraries, and the functions they implement may try to convert “foreign” input, or raise an exception. This behaviour is hard to specify; ensuring only a single array type is used is best left to the end user.

https://data-apis.org/array-api/latest/purpose_and_scope.html#out-of-scope

See also the interchange section about DLPack https://data-apis.org/array-api/latest/design_topics/data_interchange.html

asmeurer avatar Apr 13 '23 03:04 asmeurer

Thanks Ralf, for the detailed RFC; after reading it a couple of times, I have a few implementation-specific comments/questions/clarifications (some arising from the past work on the Gravitational waves demo).

It seems array-api-compat library will be key.

This bridges any gaps between the standard and real-world implementations.

With my understanding, I'd try to expand the above for everyone with a real-world example taken from scipy.signal and PyTorch (note: PyTorch is just an example, it may apply to any other array/tensor lib for some other case). Specifically, the lines below are used in scipy.signal.welch:

https://github.com/scipy/scipy/blob/c8206b80eb6b8d7ed1104b4e1b6587b6b194f0fc/scipy/signal/_spectral_py.py#L1748-L1752

Here np.result_type is used; consider the scenario where scipy.signal.welch is to be made Array API compatible with, say torch tensors (which we tried in this prototype demo). It would require the behaviour of torch.result_type to follow the Array API spec. But the reality is that torch.result_type differs from how result_type is defined in array API. Difference being that, as of today, PyTorch only allows two tensors, and in contrast, array API specifies arbitrary number of arguments being either tensor and/or dtypes.

Now I'd expect at some point, PyTorch will update their implementation to support array API for these functions, which are still left, and there is probably effort around it already (see this result_type PR https://github.com/pytorch/pytorch/pull/61168 ).

But for now, array-api-compat is a brilliant intermediary to solve such problems. See how array-api-compat fixes this specific compatibility issue below by wrapping around the current torch.result_type signature and behaviour:

https://github.com/data-apis/array-api-compat/blob/9ef7f72fa91ddc4561148f906d386d08e65def7d/array_api_compat/torch/_aliases.py#L103


Questions:

  1. Are we considering cases in SciPy which at first might not seem portable with Array API due to the usage of NumPy methods that are outside the scope of Array API, but with slight refactoring (using an alternate NumPy method and making sure no performance regression in default behaviour), we might bring that particular SciPy method/class closer to Array API Spec or just closer to other array/tensor APIs? One close example, but not exactly, is this (https://github.com/scipy/scipy/issues/14491). So basically, would it be okay to slightly refactor the current SciPy implementation of a function using some NumPy method to some other NumPy method for the sake of achieving interoperability through this work? Of course, BC is still maintained.

  2. array-api-compat adds an extension on top of the Array API Spec. I'd like to know what happens to cases (if any, we will only find more such cases during this work) that are still outside array-api-compat. For example, NumPy and other array/tensor library, for a particular function, might have the same signature and behaviour. If such np.<some_method> (which is still out of the scope of Array API) is used within a particular SciPy method/class; will that SciPy method still be marked as incompatible with this work and be completely ignored for now? Even though it would be technically possible to keep interoperability in such cases. One example is np.iscomplex or cupy.iscomplex, (see use in https://github.com/scipy/scipy/blob/main/scipy/signal/_signaltools.py#L4546). I guess there will be other use cases of such functions, which might block the compatibility of a whole SciPy module/method just because that one NumPy method is still outside of Array API even though, it had an intersection with other array/tensor libraries.

  3. Is there already an identified list of targeted submodules or methods in SciPy that can be made to work with array-api-compat and that we will focus on initially? What if a submodule is largely not compatible with Array API (and needs extra infra like dispatching with uarray etc.), but maybe a few methods/classes inside the submodule are compatible? Do we make only those methods/classes compatible in the largely incompatible submodule or completely ignore the whole submodule in such a case?

Apologies if my questions are a bit too specific with implementation details and are special examples, and if this RFC is about discussing the high-level design, feel free to ignore it.

Overall, excited that all of this work is coming together, and the current state looks pretty solid to me to start baking in this functionality in the "array consumer libraries" (scipy, sklearnetc.). Thanks for setting up the foundation tools like array-api-compat!

AnirudhDagar avatar Apr 13 '23 09:04 AnirudhDagar

Thanks for the feedback so far! Let me try to answer the high-level questions first.

for nodes with multiples devices (like 4 x A100 is pretty common these days), do we leave the use of multiple devices up to the GPU library under the hood? Do we explicitly not encourage using kwargs for controlling concurrency that is specific to say GPU libs

Yes, we should not carry anything specific to the execution model within SciPy code. That's a key design principle of the array API standard. You could use a multithreaded CPU library, a GPU library, or something even fancier like what you have here. The code should be unchanged. There's quite a bit of variation between different distributed libraries in particular, and that's out of scope to deal with I'd say.

I'm curious if we'd use a pytest decorator/magic to auto-expand eligible tests

Yes, that's what I'd like to aim for. This has worked pretty well for the array API test suite; a single switch allows you to select any compatible libraries. I imagine we'd have that as a decorator for relevant tests, or some such thing.

I wonder if spatial will unfortunately be one of the least-beneffiting modules

Probably. I imagine modules like stats and signal to be prime candidates early on.

What is the policy for handling mixed inputs when a function accepts more than one argument? This is probably covered in one of the array API documents or discussions, but I think it we need it stated explicitly here, too.

Good point, I'll add it to the issue description. @asmeurer gave a good answer already. I'd add that numpy arrays + sequences will work unchanged; non-numpy arrays + sequences should error; and in general it's a bad idea to mix array objects coming from different libraries. Rather get a clear exception there and have the user convert one kind of array to the other, to get the behavior they intended. That's how all libraries except NumPy work, and what NumPy does is an endless source of bugs and user confusion.

I expect the most common use-case is mixing $SOME_OTHER_ARRAY_TYPE and standard numpy arrays (e.g. multiplying a sparse matrix or sparse array by a numpy array, convolving some other array type with a kernel that is a numpy array, etc.).

The sparse-dense mix is the only one that in principle makes sense. I think that could be allowed in the future, but I'm not exactly sure under what circumstances. Right now all sparse array types forbid auto-densification, so where you can mix sparse-dense is quite restricted. PyTorch has the best story here, because it natively provides both dense and sparse tensors - but even there it's still a work in progress.

rgommers avatar Apr 14 '23 01:04 rgommers

So basically, would it be okay to slightly refactor the current SciPy implementation of a function using some NumPy method to some other NumPy method for the sake of achieving interoperability through this work?

Yes indeed, that is expected in many cases I'd say.

2. If such np.<some_method> (which is still out of the scope of Array API) is used within a particular SciPy method/class; will that SciPy method still be marked as incompatible with this work and be completely ignored for now?

I'd say "it depends". If it's simple to reimplement np.xxx with other functions that are part of the standard, then that can be done (special-casing numpy arrays for performance even if that makes a big difference). If it's not simple to do that, then we have a gap. If it's a big gap (hopefully won't happen often) then we should propose it for inclusion in the standard, and once that proposal is accepted implement it in array-api-compat. If it's niche functionality, then we have to deal with that on a case-by-case basis.

3. Is there already an identified list of targeted submodules or methods in SciPy that can be made to work with array-api-compat and that we will focus on initially? What if a submodule is largely not compatible with Array API (and needs extra infra like dispatching with uarray etc.), but maybe a few methods/classes inside the submodule are compatible? Do we make only those methods/classes compatible in the largely incompatible submodule or completely ignore the whole submodule in such a case?

I'd prefer for coherent sets of functionality to be upgraded at once. To stay with Tyler's example of spatial: if 90% of functionality will only work with NumPy, then there's not much point using the array API standard. In those cases we can still use the central infrastructure I proposed for input validation though (e.g., reject object and masked arrays) to gain consistency - there's nothing blocking that.

rgommers avatar Apr 14 '23 01:04 rgommers

That's how all libraries except NumPy work, and what NumPy does is an endless source of bugs and user confusion.

Thanks for the proposal! I doubt its urgent to figure out (see end), but I need to push back a little on promotion/mixing story here.

The big issues are really not about mixing/promotion. They are about NumPy thinking that:

  • any arbitrary object that doesn't shout "no" should be promoted to a NumPy array. (Even nested lists/sequences that contain objects which do shout "no". That last part could be something to allow changing even 🤔)
  • we/libraries often call np.asarray(), which coerces and doesn't give proper promotion a chance to begin with (but you would stop doing this here!). (Proper promotion would mean e.g. JAX decides, not NumPy. The __duckarray__ proposal would actually fix it, and maybe that could even still relevant in some form as a simple improvement for __array_function__. But that is not important here.)

Unlike NumPy most array implementers do not pretend that its OK to promote almost everything to their array type.

But promotion/mixing is a different thing. Many do promote: JAX (as far as I can tell), dask.array, sparse arrays, masked arrays, quantities all promote/mix great with NumPy arrays. Also subclasses with their superclass, but if they don't customize the super class namespace, they are actually supported in the current state.

sparse/dense

I admit sparse/dense may be trickier. But I can see that for many things it either makes perfect sense or can be trusted to raise a good error. Some ops densify, some ops don't, and those that where it's not clear raise. The main way this could end up problematic is if subtle code changes might somehow end up changing the result type. I doubt that is the case, but I don't know. My feeling is that this is something that sparse arrays need to figure out.

Is supporting limited promotion important enough to be needed? I don't know. I tend to think the answer is yes and I somewhat suspect e.g. Dask users may well be annoyed if its missing (I am assuming Dask has a good reason to support it very well, but trying to gauge it better. Would post the result on the array-api issue if there is a clearer one).

Overall my angle/answer would be: Fortunately, it should not be a big mess to add simple promotion/mixing support here. (I am confident about that, but I think its important to clarify since it is not obvious.) So probably it need not be an urgent thing to figure this out. But it is an interesting question that I am sure will come up again and I expect it may well prove important enough sooner or later.

seberg avatar Apr 14 '23 13:04 seberg

Thanks for your thoughts Sebastian!

(Even nested lists/sequences that contain objects which do shout "no". That last part could be something to allow changing even thinking)

yes please:) I think our NumPy 2.0 plate is pretty full already, so I'm hesitant to invest a lot of time now - but if there's easy wins there to avoid promote-to-object-array issues, let's try and take them.

Proper promotion would mean e.g. JAX decides, not NumPy

This is not really "promotion" in any proper sense. JAX and NumPy arrays are two different types of arrays that are unrelated by class hierarchy. There happens to be an ordering though, through __array_priority__. I have to admit I'm not certain how future-proof that is. Best practice is clearly to explicitly convert to the desired kind here, mixing them is not great. But it may work. What my proposal is trying to say is to reject things that aren't related, rather than blindly stuffing them into np.asarray. It should be more precise in what relations are allowed/supported - I had not thought of __array_priority__.

I've worked on quite a few PyTorch issues with the same problem. PyTorch tensors and NumPy arrays are unrelated, so things like x + t working but t + x erroring out (or vice versa) is quite confusing. It should just error.

But promotion/mixing is a different thing. Many do promote: JAX (as far as I can tell), dask.array, sparse arrays, masked arrays, quantities all promote/mix great with NumPy arrays.

Again, only when there is a clear relation. scipy.sparse and pydata/sparse mostly errors out when mixed with ndarray. Masked arrays do work because they're a subclass (modulo silently dropping the mask of course). Some quantities objects work well with numpy because they're a subclass (see, e.g. AstroPy's Quantity) or because they use things like __array_function__. Those cases will continue to work. In the absence of that well-defined relationship, things will not work. So you cannot and should not mix Quantity with JAX/PyTorch/CuPy/sparse.

Overall my angle/answer would be: Fortunately, it should not be a big mess to add simple promotion/mixing support here.

If this is well-defined, I agree. We just need to be very explicit what the mechanisms are, and they should be introspectable. I want to avoid "just try it and see what rolls out, YMMV".

rgommers avatar Apr 14 '23 14:04 rgommers

Collapsing everying into details, because... It doesn't matter much for the RFC, but I can't leave some of these things unanswered.

(We could get hung up and say I should use "implicit conversion" like the C-standard making the argument about implicit/explicit conversion – but Julia uses "promotion" the way I do as far as I understand, so I will change my habit when someone asks me to or points out why its very wrong.)

but if there's easy wins there to avoid promote-to-object-array issues, let's try and take them.

This is not about promote-to-object-array issues. Its about JAX sayings its higher priority then NumPy, but NumPy ignoring that when you write: numpy_arr + [jax_array]. Maybe also about a way to write np.asarray(jax_array) that raises an error because JAX says its the higher priority.

This is not really "promotion" in any proper sense

Aha? JAX and NumPy arrays are two different types of arrays that are unrelated by class hierarchy. Just as: int and float are two different types of numbers that are unrelated by class hierarchy.

Unless you also say 3 + 3.0 = 6.0 is not promotion, in that case, I agree.

So yes, there is an (abstract) base class (array and number). There is a clear expected hierarchy everyone agrees on and expects (JAX > NumPy, Dask > NumPy and float > int). And neither of these conversions when mixing the two are perfectly "safe" in some sense, but implicit int -> float conversions are still universally useful and numpy -> dask.array/JAX/masked array implicit conversions seem also useful (otherwise Dask would not do it, no?).

Sure "array" is a bit of a more complicated then "number".

There happens to be an ordering though, through array_priority

__array_priority__ is more of a remnant. The real thing is __array_ufunc__ which can just be None to tell NumPy: "No". It isn't relevant here.

I've worked on quite a few PyTorch issues with the same problem. PyTorch tensors and NumPy arrays are unrelated, so things like x + t working but t + x erroring out (or vice versa) is quite confusing. It should just error.

Fine its confusing and probably has things that need improving. But also __array_ufunc__ = None should guarantee you an error. Yeah, it seems like it doesn't end up with a great error but you could probably improve on that in your __add__ or by implementing a dummy __array_ufunc__.

But I don't even think that matters. If a type is comfortable with implementing __array_ufunc__ such as Dask (or even just the operators like JAX), then that type can mix with NumPy arrays fine in practice (not with np.<func>, but we know we won't use those here).

And there is no problem with allowing such a type to signal that it wants to do just that. That still allows Torch to signal that it doesn't want to mix!

So is it annoyingly hard to tell NumPy "no"? Yes, and maybe that needs to be improved. But frankly, I think it is misleading to use those difficulties as an argument that mixing is generally pretty much doomed to fail.

We have many examples of successfull implementations that mix fine. Like dask, masked arrays, or quantities...

What my proposal is trying to say is to reject things that aren't related, rather than blindly stuffing them into np.asarray

But because you stop stuffing things blindly into np.asarray() you suddenly can start reasoning about their relation. The problem is that right now we break any relation by explicit conversion via np.asarray().

Take np.asarray() out of the equation, and Dask and NumPy already have a well defined relation and mix great!

Masked arrays do work because they're a subclass

No, I would be very surprised if Allan's masked array is a subclass and I am sure it has better behavior than np.ma and mixes great with NumPy arrays.

The important point is that it wants to work. That can just as well be by implementing __array_ufunc__ (or even only operators). Allowing interop doesn't mean that np.func(my_object) needs to work (although mylib.func(numpy_array) must).

If this is well-defined, I agree. We just need to be very explicit what the mechanisms are, and they should be introspectable. I want to avoid "just try it and see what rolls out, YMMV".

This is a reason, unlike the "it's a mess" argument, IMO. We do have quite a bit of prior art, since every other protocol defines a mechanism. It still needs care of course (there are small variations after all). For me the main point is conservative guidance similar and additionally to NEP 18, though.

But at that point it's: We still need to figure that out because we don't want to get it wrong. And not "its a bad idea".

seberg avatar Apr 14 '23 17:04 seberg

But at that point it's: We still need to figure that out because we don't want to get it wrong. And not "its a bad idea".

Maybe we're not disagreeing all that much. I said briefly "it's a bad idea" because (a) it's apparently so complex that after all these years we still haven't figured it out indeed, and (b) the gains are in many cases a minor amount of characters saved to deal with explicit conversions in the user's code instead. So my personal assessment is that this isn't worth the effort. Others may disagree of course, and spend that effort.

I think the appropriate place for discussing this topic further is either on the numpy repo or on pydata/duck-array-discussion, which was/is an effort specifically to sort out this hierarchy.

It doesn't matter much for the RFC

I agree. We're not closing any doors in this RFC, it can in principle be enabled in a future extension.

rgommers avatar Apr 19 '23 05:04 rgommers

A few assorted questions/comments:

  1. Would it make sense to introduce the asarray replacement first and eagerly warn about conversions whose behavior would change under this RFC.
  2. For testing, is the idea that you run the test suite multiple times with different ARRAY_API_TEST_MODULE flags? My first thought was this should be a pytest fixture where xp is an explicit argument to the test. This would allow multiple versions to be tested in a single pytest call. The behavior of this fixture could also be configurable via environment flags.
  3. Is there a policy for introducing new compiled code? Do we need to preserve the historical array_api compatible version or otherwise it would be a bc-breaking change for non-cpu arrays.
  4. I agree that mixing array libraries shouldn't be a priority, however I wonder how multiple array types from the same library would be handled, assuming they use the same namespace?

peterbell10 avatar Apr 20 '23 15:04 peterbell10

Thanks @peterbell10, all good questions.

  1. Yes, perhaps. As long as there's a way to opt in to the new behaviour easily to make those warnings go away (maybe with an environment variable?). It'd be annoying to have to change all your code away from, say, PyTorch CPU tensors as input by explicit conversion to NumPy arrays, and then back once the new behavior arrives. It depends a bit on timing - release cycles are pretty slow. Will have to think about this one a bit more.

  2. That was the idea. I do like fancier things too - don't know enough about fixtures, but I assume explicit xp is good practice. I do have in mind to run the pytorch tests only in 1-2 separate CI jobs, and not double the test time everywhere.

  3. Great point. This policy should be written down as part of this RFC (and then in the contributor guide). And yes, I do think that is "keep the pure Python path".

  4. We'll have to play with that. We may just simply retrieve xp from the first array input, and then let that library deal with rejecting mixed arrays. In general when you get two unknown array types that don't have a class hierarchy, it's not really possible to know whether they work together or not. Once you have a fixed library like pytorch, you know how to check, but for unknown libraries I'm not sure. This is the trickiest question to answer though, it'll require a bit of prototyping.

rgommers avatar Apr 20 '23 22:04 rgommers

Let me get a bit more specific now. I started studying the demo work here: https://quansight-labs.github.io/array-api-demo/GW_Demo_Array_API.html

scipy.signal.welch seemed to be a well-behaved target for prototyping based on that so I tried doing that from scratch on a feature branch locally, progressively allowing more welch() tests to accept CuPy arrays when an env variable is set.

Even for this "well-behaved"/reasonable target, I ran into problems pretty early on. For example, both in the original feature branch and in my branch, there doesn't seem to be an elegant solution made for handling numpy.lib.stride_tricks.as_strided. The docs for that function don't even recommend using it, and yet CuPy (and apparently torch from the Quansight example) do provide it, outside of the array API standard proper.

So, I guess my first real-world experience makes me wonder what our policy on special casing in these scenarios will be--ideally, I'd like to just remove the usage of as_strided() and substitute with some other approach that doesn't require conditionals on the exact array type/namespace. While this is a rather specific blocker, if I encounter something like this even for a "well behaved" case, I can imagine quite a few headaches for the cases that are not well behaved.

Anyway, maybe I'll get an idea of what we want to do in general based on reactions to this specific case. I assume we could move much faster if we just accept out-of-array-API common shims by checking the array types like Anirudh and Ralf did here: https://github.com/AnirudhDagar/scipy/blob/array-api-demo/scipy/signal/spectral.py#L2007 in a sort of "perfection is the enemy of done" approach, though this is obviously not quite perfect.

Something like this is clearly out of scope for array-api-compat. If there are going to be a few examples of non-API standard but tricky to work around things, perhaps the recommended work-arounds/shims might be collected somewhere (or maybe scikit-learn has already done a replacement of this nature more recently).

tylerjereddy avatar Apr 25 '23 19:04 tylerjereddy

Thanks @tylerjereddy, that's a very good question. I think that indeed this is something that will come up regularly, and the specific cases at hand may determine how fast things can be implemented and whether the end result is more/less maintainable or more/less performant.

Every case will be a bit different, and in some a rewrite will help, in some it may lead to an extra code path or some such thing. Let's have a look at this case - this is the code in question:

    # Created strided array of data segments
    if nperseg == 1 and noverlap == 0:
        result = x[..., np.newaxis]
    else:
        # https://stackoverflow.com/a/5568169
        step = nperseg - noverlap
        shape = x.shape[:-1]+((x.shape[-1]-noverlap)//step, nperseg)
        strides = x.strides[:-1]+(step*x.strides[-1], x.strides[-1])
        result = np.lib.stride_tricks.as_strided(x, shape=shape,
                                                 strides=strides)

The as_strides usage is there to save memory. It's actually pretty difficult to understand what the code does exactly though, so it's good for performance but not great for maintainability. If you check what the code does, you find that it constructs a result array that is typically ~2x larger than the input array, sometimes more. In case of the example in the welch docstring, it's a (100_000,) shaped input, and an intermediate array of (194, 1024) - so 2x larger. However, the very next line in that function is:

result = detrend_func(result)

So now result is a full copy anyway, because the detrending is out-of-place. After that there's more function calls with similar overwriting of result. Hence, the actual peak memory used is very similar to the case where you wouldn't use as_strided but replace it with

result2 = np.empty(shape, dtype=x.dtype)
for ii in range(shape[0]):
    result2[ii, :] = x[ii*step:(ii*step + nperseg)]

Now, to complicate matters a little, when we use empty, we also want/need to add device support to support GPUs:

# This is temporary, once all libraries have device support, this check is no longer needed
if hasattr(x, 'device'):
    result = xp.empty(shape, dtype=x.dtype, device=x.device)
else:
    result = xp.empty(shape, dtype=x.dtype)

for ii in range(shape[0]):
    result2[ii, :] = x[ii*step:(ii*step + nperseg)]

At that point, we can say we're happy with more readability at the cost of some performance (TBD if the for-loop matters, my guess is it will). Or, we just keep the special-casing for numpy using as_strided. At that point, we do have two code paths, but no performance loss and also easier to understand code.

If there are going to be a few examples of non-API standard but tricky to work around things, perhaps the recommended work-arounds/shims might be collected somewhere

Agreed, that will naturally emerge I think.

rgommers avatar Apr 25 '23 20:04 rgommers

# This is temporary, once all libraries have device support, this check is no longer needed
if hasattr(x, 'device'):

FWIW, array-api-compat has a device() helper function for exactly this.

asmeurer avatar Apr 25 '23 20:04 asmeurer

Thanks @tylerjereddy, as Ralf said, I'm pretty sure these will have to be dealt case by case, and hence in the previous comment above (Question 1), I wanted to know exactly what will be our policy on such cases. And I guess the answer is again dealing case by case. It would be very hard to know all such cases before starting to work on the transition.

Regarding the demo, at the time of its development, our goal was to get a prototype version using Array API out (which is not perfect). The way it's done in my feature branch, special casing, is not ideal, and in such cases, it would come down to the possibility of refactoring the code to achieve the same using methods compliant with Array API as shown above by Ralf.

@tylerjereddy there is this other demo that was developed by @aktech, and might be of interest too: https://labs.quansight.org/blog/making-gpus-accessible-to-pydata-ecosystem-via-array-api

If there are going to be a few examples of non-API standard but tricky to work around things, perhaps the recommended work-arounds/shims might be collected somewhere

This is a good idea, not only for SciPy, but I expect similar issues/problems will arrive with other array-consuming libraries (eg. sklearn) and having a place to collect all such examples would make development more streamlined and consistent for the same references later across the libraries. But again, this depends on how many such examples we will encounter.

AnirudhDagar avatar Apr 25 '23 22:04 AnirudhDagar

It sounds like we propose to require libraries to be FFT API implementation complete--curiously, avoiding the usage of np.fft like this on my branch fixes a test failure:

diff --git a/scipy/signal/_spectral_py.py b/scipy/signal/_spectral_py.py
index befe9d3bf..ea1f6eb27 100644
--- a/scipy/signal/_spectral_py.py
+++ b/scipy/signal/_spectral_py.py
@@ -1986,10 +1986,10 @@ def _fft_helper(x, win, detrend_func, nperseg, noverlap, nfft, sides):
 
     # Perform the fft. Acts on last axis by default. Zero-pads automatically
     if sides == 'twosided':
-        func = xp.fft.fft
+        func = sp_fft.fft
     else:
         result = result.real
-        func = xp.fft.rfft
+        func = sp_fft.rfft
     result = func(result, n=nfft)

ARR_TST_BACKEND=numpy python dev.py test -j 10 -t scipy/signal/tests/test_spectral.py

======================================================================================================================================================================================= FAILURES =======================================================================================================================================================================================
___________________________________________________________________________________________________________________________________________________________________________ TestSTFT.test_roundtrip_scaling ____________________________________________________________________________________________________________________________________________________________________________
[gw4] linux -- Python 3.10.6 /home/tyler/python_310_scipy_dev_work/bin/python
scipy/signal/tests/test_spectral.py:1614: in test_roundtrip_scaling
    assert_allclose(Zp[:129, :], Zp0)
        X          = array([   0.+0.j,    0.+0.j,    0.+0.j,    0.+0.j,    0.+0.j,    0.+0.j,
          0.+0.j,    0.+0.j,    0.+0.j,    0....+0.j,
          0.+0.j,    0.+0.j,    0.+0.j,    0.+0.j,    0.+0.j,    0.+0.j,
          0.+0.j,    0.+0.j,    0.+0.j])
        Zp         = array([[ 0.00000000e+00+0.00000000e+00j,  0.00000000e+00+0.00000000e+00j,
         0.00000000e+00+0.00000000e+00j, .......,
         0.00000000e+00+0.00000000e+00j,  0.00000000e+00+0.00000000e+00j,
         2.04124145e-01-5.01096650e-03j]])
        Zp0        = array([[ 0.00000000e+00+0.00000000e+00j,  0.00000000e+00+0.00000000e+00j,
         0.00000000e+00+0.00000000e+00j, .......,
         0.00000000e+00+0.00000000e+00j,  0.00000000e+00+0.00000000e+00j,
        -2.04124145e-01+0.00000000e+00j]])
        Zp1        = array([[-7.25194643e-16+5.66558315e-17j, -7.25194643e-16+5.66558315e-17j,
        -7.25194643e-16+5.66558315e-17j, .......,
        -7.25194643e-16-5.66558315e-17j, -7.25194643e-16-5.66558315e-17j,
         2.04124145e-01-5.01096650e-03j]])
        Zs         = array([[ 0.00000000e+00+0.00000000e+00j,  0.00000000e+00+0.00000000e+00j,
         0.00000000e+00+0.00000000e+00j, .......,
         0.00000000e+00+0.00000000e+00j,  0.00000000e+00+0.00000000e+00j,
        -1.56250000e-02+0.00000000e+00j]])
        power_x    = 2.0
        psd_Zp     = array([2., 2., 2., 2., 2., 2., 2., 2., 2.])
        self       = <scipy.signal.tests.test_spectral.TestSTFT object at 0x7f0e72850b20>
        x          = array([ 2.,  0., -2., ...,  0., -2.,  0.])
        x1         = array([ 2.-7.07631847e-34j,  0.+0.00000000e+00j, -2.+1.51143410e-32j, ...,
        0.+0.00000000e+00j, -2.+7.36976637e-17j,  0.+0.00000000e+00j])
/usr/lib/python3.10/contextlib.py:79: in inner
    return func(*args, **kwds)
E   AssertionError: 
E   Not equal to tolerance rtol=1e-07, atol=0
E   
E   Mismatched elements: 496 / 1161 (42.7%)
E   Max absolute difference: 8.88611995e-16
E   Max relative difference: 15.1575437
E    x: array([[ 0.000000e+00+0.000000e+00j,  0.000000e+00+0.000000e+00j,
E            0.000000e+00+0.000000e+00j, ...,  0.000000e+00+0.000000e+00j,
E            0.000000e+00+0.000000e+00j, -2.041241e-01+0.000000e+00j],...
E    y: array([[ 0.000000e+00+0.000000e+00j,  0.000000e+00+0.000000e+00j,
E            0.000000e+00+0.000000e+00j, ...,  0.000000e+00+0.000000e+00j,
E            0.000000e+00+0.000000e+00j, -2.041241e-01+0.000000e+00j],...
        args       = (<function assert_allclose.<locals>.compare at 0x7f0e727f1120>, array([[ 0.00000000e+00+0.00000000e+00j,  0.00000000e+...,
         0.00000000e+00+0.00000000e+00j,  0.00000000e+00+0.00000000e+00j,
        -2.04124145e-01+0.00000000e+00j]]))
        func       = <function assert_array_compare at 0x7f0f195de200>
        kwds       = {'equal_nan': True, 'err_msg': '', 'header': 'Not equal to tolerance rtol=1e-07, atol=0', 'verbose': True}
        self       = <contextlib._GeneratorContextManager object at 0x7f0f195e5de0>


Also, for stuff like assert_allclose() do we want to assume that both arguments have been copied back to the host? It looks like CuPy has its own assert_allclose() as well, though we probably can't rely on that for libs in general. There is a to_device() adapter in the compat library, but not to_host() I don't think. For CuPy there is stuff like .get() for an explicit pull-back to host, but it is intentionally designed not to copy DtoH automatically/implicitly and will error out if you try.

tylerjereddy avatar Apr 27 '23 23:04 tylerjereddy

Brief follow-up, that's reproducible on main as well, if you swap in the NumPy fft variants on those two lines. Does it merit its own issue, or nothing particularly surprising for the FFT gurus? I don't see anything searching for that test on our tracker.

tylerjereddy avatar Apr 27 '23 23:04 tylerjereddy

Brief follow-up, that's reproducible on main as well, if you swap in the NumPy fft variants on those two lines. Does it merit its own issue, or nothing particularly surprising for the FFT gurus?

There will be small differences in precision, because the scipy.fft implementation is C++ code that uses SIMD (and hence is faster), while the numpy.fft code is plain C. We'd like to update NumPy to use the same C++ code as SciPy, but that's work. If there's a test failure somewhere, that is business as usual and unrelated to this RFC I think.

rgommers avatar Apr 28 '23 13:04 rgommers

Also, for stuff like assert_allclose() do we want to assume that both arguments have been copied back to the host? It looks like CuPy has its own assert_allclose() as well, though we probably can't rely on that for libs in general. There is a to_device() adapter in the compat library, but not to_host() I don't think.

You should be able to use to_device('cpu') instead of to_host I'd think. Other than that: testing for specific libraries, especially GPU, requires a bit more thought. We do not require testing functions to be present within the array API standard, so I imagine we'll be using something like our own assert_allclose that switches explicitly to CuPy's assert_allclose when the input is a CuPy array.

rgommers avatar Apr 28 '23 13:04 rgommers

container type in == container type out

I think this will clutter the SciPy code with endless container type conversions in any exported function or method. It will contribute to code rote and also make things slower.

Basically I think users should be responsible for their own container types.

sturlamolden avatar Apr 29 '23 02:04 sturlamolden

The diff for my draft branch for enabling CuPy/NumPy swapping for welch() is over at https://github.com/tylerjereddy/scipy/pull/70. I suspect the main thing of interest might be the testing shims to enforce container type in == container type out and to allow swapping between NumPy/CuPy in the tests with an env var. I found no evidence of performance improvements for simple uses with CuPy, but finding the right problem (and problem size) may be best left to the domain experts.

One thing I found myself wanting was a universal/portable to_device(x, "cpu") for testing purposes as noted above. It sounds like it isn't clear what we'll do upstream for that just yet--array-api-compat temporarily merged my shim, but because this changes the array namespace when moving to host (NumPy) with CuPy but not i.e., torch (which has genuine CPU and GPU tensors), things are a bit murky...

tylerjereddy avatar May 02 '23 18:05 tylerjereddy

I think this will clutter the SciPy code with endless container type conversions in any exported function or method. It will contribute to code rote and also make things slower.

I don't think that will be the case. It's possible there will be a few more type checks, but no more (or even fewer) type conversions. A few examples:

  • numpy.ndarray: unchanged, no conversions now or in the future
  • numpy.ndarray subclasses: now converted to ndarray, afterwards kept unchanged
  • cupy.ndarray and torch.Tensor (gpu): now not accepted, in the future accepted without conversion where possible
  • torch.Tensor (cpu): now converted to numpy.ndarray, in the future (a) unchanged where we have pure Python code and (b) converted once internally and then back in case we hit compiled code.
    • I'll note that it will be ensured that this happens once for case (b), and not a to-from conversion in multiple places when one scipy function internally calls another scipy function

I found no evidence of performance improvements for simple uses with CuPy, but finding the right problem (and problem size) may be best left to the domain experts.

Thanks for sharing your progress on trying this out Tyler! Regarding this performance expectation: typically CuPy will be slower than NumPy for small arrays (due to overhead of data transfer and starting a CUDA kernel) and there will be a cross-over point beyond which CuPy gets faster. Typically that's in the 1e4 to 1e5 elements range. And beyond 1e6 or 1e7 you'll see a roughly constant speedup, the magnitude of which will depend on the details but typically in the 5x-50x range.

rgommers avatar May 03 '23 07:05 rgommers

For this small script on that branch:

import sys
import time
import numpy as np
import cupy as cp
from scipy.signal import welch


def main(namespace):
    size = 70_000_000
    if namespace == "numpy":
        x = np.zeros(size)
    elif namespace == "cupy":
        x = cp.zeros(size)
    x[0] = 1
    x[8] = 1
    f, p = welch(x, nperseg=8)
    print("f:", f)
    print("p:", p)


if __name__ == "__main__":
    start = time.perf_counter()
    main(namespace=sys.argv[1])
    end = time.perf_counter()
    print("Elapsed Time (s):", end - start)
  • 7e6 points:
    • NumPy: 1.42 s
    • CuPy: 12.52 s
  • 7e7 points:
    • NumPy: 13.84 s
    • CuPy: 126.54 s

I don't know that we expect that particular incantation to get faster though. Could do some profiling or check the device copy activity with nsight-systems I suppose, but the array type pass-through/preservation tests do pass. I don't know that we're guaranteed to avoid performance issues with devices, depending on the algorithm/need to do a genuine amount of work on the device.

Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz vs NVIDIA GeForce GTX 1080 Ti

tylerjereddy avatar May 03 '23 16:05 tylerjereddy

Hmm, I'm not sure how to interpret that, there may be an issue like an unexpected device transfer, or it's the for-loop.

rgommers avatar May 03 '23 16:05 rgommers

For 7e6 pts, with NumPy and line profiling on, 4.4 seconds (4.6 s total) is spent on _fft_helper, with 90 % time on the for loop shim around as_strided. With CuPy profiling, 17.1 seconds are spent in _fft_helper, with > 90 % on the for loop shim. So, I guess it appears that 1.7 million memory assignments of CuPy array to slice of CuPy array are not competitive to the same slice/mem assignemnt on NumPy host array; i.e., this is more expensive on the device: result[ii, :] = x[ii * step: (ii * step + nperseg)]. result and x are both class 'cupy.ndarray'.

GPUs have smaller memory caches near their streaming multiprocessors, right? Maybe that's why--because you need to move the memory farther from the "chip?"

tylerjereddy avatar May 03 '23 17:05 tylerjereddy

See what you get if you special-case CuPy and use cupy.lib.stride_tricks.as_strided?

rgommers avatar May 03 '23 17:05 rgommers

CuPy is decisively faster with its in-house as_strided (and tests still pass) for 7e7 points, about 10x faster than NumPy on the benchmark. So I guess the question becomes how high a price we're willing to pay for array API conformant pass-throughs vs. special casing. I'm guessing that order-of-magnitude+ slowdowns may cross some kind of practicality line.

tylerjereddy avatar May 03 '23 18:05 tylerjereddy

I'm guessing that order-of-magnitude+ slowdowns may cross some kind of practicality line.

Yes, that sounds right to me. Also because the special-casing is cheap, so if that makes it go from 10x slower to 10x faster, that's probably not a hard decision. I'd lean that way, and then after doing development with these kind of one-off decisions behind a global flag for a while, assess whether we like the overall look of the special cases and where the lines are for this sort of thing.

rgommers avatar May 03 '23 18:05 rgommers

To me Tyler's experience is quite common for code porting from CPU to GPU, way before Array API was a thing (e.g. CuPy offers a tutorial for writing CPU/GPU agnostic code), just ask any GPU Hackathon participants in the DOE space 🙂

So adding on top of Ralf's feedback, the most interesting question I'd ask is: If you find a 10x faster approach for CuPy, does it also help boost the existing NumPy code? (Perhaps due to vectorization or something, that NumPy happens to be also good at.) If so, then there's no need for special-casing, it just means the original code is slow everywhere.

leofang avatar May 03 '23 18:05 leofang