sparse icon indicating copy to clipboard operation
sparse copied to clipboard

Use Cython, Numba, or C/C++ for algorithmic code

Open mrocklin opened this issue 6 years ago • 93 comments

It seems likely that we'll want the ability to write fast numeric code in a low-level-ish language. There are a few options:

  1. Cython
  2. Numba
  3. C/C++, wrapped with a variety of other options

We should make a decision here before investing serious time in one path or the other.

There are, I think, a few categories of concerns:

  1. Speed
  2. Community adoptability (as a dependency)
  3. Barriers to attracting new developers
  4. Barriers to maintenance

mrocklin avatar Mar 04 '18 14:03 mrocklin

My personal thoughts

The default today seems to be Cython. This would be a fine choice. It seems to be fine for speed and community adoptability. I do think that it imposes a serious barrier to attracting new developers (I'd say 80% of the candidate pool will avoid Cython), and it will force a more complex build-and-deploy process. For example we can no longer just point people to try things on master to get quick feedback. We'll get questions from people on how to install from source on windows, etc..

C/C++ is, I think, a little cleaner than Cython in terms of attracting new developers. I think that C tends to be more often within people's comfort zone.

Numba is great for new developers and maintenance, but has issues in community adoptability (folks aren't accustomed to depending on it, thoughts here). Numba also has issues if we're going to be doing a lot with dynamic data structures and want to use std::map, std::vector and friends. Having some familiarity with both Numba and Cython I personally also find the development cycle to be must faster with Numba than with Cython.

Suggestion

If I were doing this work I would stick to Numba until numba became an issue (either due to a need for dynamic data structures or downstream libraries being unwilling to depend on it) and then I would switch to something else. The cost of starting with Numba to me seems to be zero. We don't have to update our release mechanism, we don't have to have special documentation or field extra compilation help requests. The cost to switch to Cython if necessary is likely very small. We'll likely be writing exactly the same code we would write for Cython, just with less stuff around it.

Numba is also likely to be faster without us thinking as much, which I think is useful when experimenting with new algorithms, like in #125

However

However, I'm unlikely to do any of this work in the near future, and the preferences of those doing the work should have precedence.

mrocklin avatar Mar 04 '18 14:03 mrocklin

cc @rgommers @ogrisel @stefanv @shoyer for thoughts and hopefully critical feedback. In what system would you like to see low-level code in the pydata/sparse library written?

mrocklin avatar Mar 04 '18 14:03 mrocklin

In favor of Cython

  1. Speed is near-native C++ with the benefits of Python.
  2. Community adoptability: We will only need it for development, not as a hard dependency, so it has zero impact.
  3. Barriers to attracting new developers: Only a few algorithms need to be in Cython/Wrapped C++, others will be pure Python. Developers will have the choice of both wrapped C++ with Cython, or pure Cython.
  4. Wrapping logic is really trivial. BLAS and friends have wrappers in SciPy, we could use those.
  5. Making DOK faster is a priority for me for batch assignments (i.e. ndarray and SparseArray assignments), this needs std::unordered_map.
  6. Switching can be a somewhat expensive process down the line.
  7. @shoyer's algorithm for elemwise discussed in #1 will come in useful for CSD. Since the passes in that algorithm are likely to be slow, Numba could be almost twice as slow as std::vector. Other uses are very likely to pop up sooner or later.
  8. Debugging is better. This can be critical with complex logic.
  9. Interleaving code is relatively painless.
  10. We could re-use some things such as for radix sort.

In favor of Numba

  1. Pure Python. This could be useful for attracting devs.
  2. Faster performance overall.
  3. Iteration of N-dimensional arrays is possible without complex logic. This will be useful for advanced indexing.
  4. GPU support (?)

My Concerns

I guess it comes down to exactly two questions for me: Will we have to switch at some point (which I would like to avoid outright), or will Numba devs be willing to add support for accelerated dict and list operations, and interleaving? If the answers to both these questions are in the favor of Numba, then it's a no-brainer for me.

If, in the other hand, the answer to even one of these questions is "no"... Then my vote is for a mixture of Cython and Cython-wrapped C++.

I guess my main worries with Numba come down to the following:

  • Not everything is possible in it. I can think of a few occasions I ran into a brick wall with Numba.
  • We might have to switch, which could be more expensive than going with Cython in the first place.
  • Community adaptability.

hameerabbasi avatar Mar 04 '18 14:03 hameerabbasi

Some thoughts:

  1. I would not expect accelerated list/dict implementations from Numba. This is probably out of scope for them.
  2. Where possible I would like to see us avoid dynamic data structures for performance reasons
  3. I would not expect std::map to be significantly faster than Python's dict. At best I would hope for 2-3x improvements, not 10x.
  4. I find debugging to be much nicer in numba, I just remove numba.jit. I'd be curious to learn your debugging technique for Cython
  5. Can you expand what you mean by interleaving?
  6. I think that the presence of Cython in the codebase is itself a barrier to adoption. People won't be able to run from master without passing through a compilation step.
  7. I'm surprised to hear that switching from numba to cython seems expensive to you. To me it seems like the only cost is the work saved in the first place. In my experiences going the opposite direction (Cython -> Numba) code typically works after I strip out all of the annotations, rename the file from pyx to py, and put numba.jit on the function. For numeric for-loopy code they seem to be more or less equivalent to me.

However, mostly this is me trying to convince you not to adopt Cython prematurely. At this point you're doing most of the work and should probably decide.

As a co-maintainer though I would appreciate it if, prior to actually writing algorithmic code in Cython you first to setup the build process, packaging support (we should consider conda-forge), and documentation. I don't expect this to be particularly difficult, I don't think we'll need any external libraries to speak of, but it's best to understand the costs ahead of time and build in solutions in case you leave this project in the future.

mrocklin avatar Mar 04 '18 14:03 mrocklin

FWIW while I'm probably biased towards numba due to professional association I also have a lot of respect for Cython. As someone who has used both in different projects though (numba for personal things, cython for geopandas) and as someone who maintains a number of projects, the added cost of maintaining a Cython project in terms of person-hours is high, and that cost tends to be focused on fewer maintainers. I think I tend to optimize these days more to reduce and diffuse maintenance costs than most other things.

I personally would probably stop spending as much personal time fielding maintenance questions or build/release issues if we go for Cython. This is something that the community would need to pick up long term.

mrocklin avatar Mar 04 '18 15:03 mrocklin

I would not expect std::map to be significantly faster than Python's dict. At best I would hope for 2-3x improvements, not 10x.

There is also std::unordered_map, which is 2x-3x faster than std::map, and is the best equivalent to dict. std::map is more of a binary search tree.

I'd be curious to learn your debugging technique for Cython

GDB supports python natively, and Cython has extensions for GDB. It's on the command line, though (I've tried and failed to find a good IDE solution), so it is a bit of a pain compared to your solution. Also, it freezes your interpreter in, so conda env is a no-no.

Can you expand what you mean by interleaving?

Interleaving Pure Python/optimized code. Other problems would be calling unoptimized code from optimized.

People won't be able to run from master without passing through a compilation step

  • We can have a unified setup.py.
  • All developers would need to do for initial setup from master is pip install -e (with or without Cython installed)
    • If this is to work without Cython, :
      • we need a few more tweaks to setup.py to either pip install Cython
      • Or check-in generated C++ (preferred)
        • This would make the release easier as well.
  • If they frequently work with C/C++/Cython, Cython also has JIT support.

In my experiences going the opposite direction (Cython -> Numba) code typically works after I strip out all of the annotations, rename the file from pyx to py, and put numba.jit on the function.

There are also costs in terms of alienated developers we may have accrued that are used to Numba.

However, mostly this is me trying to convince you not to adopt Cython prematurely.

Of course, I understand the costs, and if the consensus is on Cython, then I would build these solutions in, along with docs, right at the very first check-in. I believe properly > quickly.

After your convincing and experience, I'm not too opposed to Numba, but would certainly like to wait for thoughts of others (particularly Scipy devs and whether they would consider it to be a blocker) before committing to it.

At this point you're doing most of the work and should probably decide.

My vote is still slightly in favor of Cython, but I'm open to Numba as an option. I don't believe I should make the decision, I would really wait for community consensus.

hameerabbasi avatar Mar 04 '18 16:03 hameerabbasi

I have started watching this project only recently and although I was super happy to finally a sparse multidimensional tensor lib that I could contribute to I'm just an observer here. Funnily enough I spent some time yesterday trying to implement specialized tensor products for DOK with numba, and found out that dicts are not a recognized type yet. I used ordered list of tuples instead of dicts everywhere, which seems to be supported well but @mrocklin must know the details better than I do. Here is one argument in favour of numba, which I haven't seen listed above. I have found very useful in other projects the ability to generate code, based on number of dimensions for instance, and jit-compile it to a fast, memory-savvy option. There are several instances, where I found this approach to be much faster than using numpy's generic functions, and I wouldn't know how to perform the same with Cython or C/C++ for that matter as dimension combinations are not known at compile-time.

albop avatar Mar 04 '18 16:03 albop

At this point, if Scipy devs came out and said they would be okay to depend on this project if Numba was used, I'd go for Numba. I just had a look at the docs and they're significantly better than Cython's (which alone makes them worth considering).

It'd be nice to have another contributor, @albop. :-)

hameerabbasi avatar Mar 04 '18 16:03 hameerabbasi

@hameerabbasi : I was considering saying something about me contributing to the project, but wouldn't like to disappoint if it takes me a bit too long to start. I'll try to create some issues first ;-) (with the usecase I have in mind...)

albop avatar Mar 04 '18 16:03 albop

@albop If you have any questions about where something is handled, or how to implement something, don't hesitate to raise an issue and cc me on it. :-)

hameerabbasi avatar Mar 04 '18 17:03 hameerabbasi

Numba actually does currently accelerated list and set. I don't know if accelerating dict is on their roadmap or not.

shoyer avatar Mar 04 '18 17:03 shoyer

C++ is fine as long as the wrapper tool is modern enough to make it easy to package as a wheel (e.g. Cython or pybind11, no boost python please). I am fine with numba as a dependency as well. I can't speak for scipy developers.

ogrisel avatar Mar 04 '18 17:03 ogrisel

They do, it's even been labelled a high priority. xref numba/numba#2096

Numba actually does currently accelerated list and set. I don't know if accelerating dict is on their roadmap or not.

This makes me much more comfortable using Numba.

hameerabbasi avatar Mar 04 '18 17:03 hameerabbasi

I can't speak for all SciPy developers, but if Numba continues to improve and be maintained (not sure of its status now with all the changes at Anaconda?) then at some point we'll likely adopt it. The concerns regarding build dependencies etc. are minor compared to all the C/C++/Fortran we have already. If you want me to start that discussion now on scipy-dev, I'd be happy to do so.

Re debugging: it depends on what you're debugging. If it's an issue with the logic of your code, then Numba is much better (just remove @jit). If it's a problem with Numba or Cython behavior though, then Cython's debugging support is far better than Numba's. The latter case is probably less common, but not that uncommon - both due to changes in new releases and things that don't work identically to plain Python code in the first place. I'd say that as a contributor I would prefer Numba, and as a library maintainer I'd prefer Cython.

We recently got a similar question from Pythran, and I came up with a list of factors for the decision at https://mail.python.org/pipermail/scipy-dev/2018-January/022334.html. The ones I'm not sure about for Numba are:

  • portability (exotic stuff like AIX, Raspberry Pi, etc.)
  • maintenance status
  • templating support for multiple dtypes

rgommers avatar Mar 04 '18 18:03 rgommers

How will the sparse array object be represented in numba? The JIT class capability.is currently very limited.

datnamer avatar Mar 04 '18 18:03 datnamer

@datnamer This is a discussion on how we will adopt Numba for our internal algorithms, to make our own library faster for users. It isn't much of a discussion about how to make our classes JIT-compatible.

hameerabbasi avatar Mar 04 '18 18:03 hameerabbasi

If you want me to start that discussion now on scipy-dev, I'd be happy to do so.

That'd be really appreciated. 😃

hameerabbasi avatar Mar 04 '18 18:03 hameerabbasi

cc @woodmd, since you're planning to contribute, you should get a say. Also, @nils-werner.

hameerabbasi avatar Mar 04 '18 19:03 hameerabbasi

As long as things go well, numba has an edge above Cython. But once they start to misbehave, I'm not sure a) how to identify that it is misbehaving in numba or b) how to debug numba and steer it in the right direction. We have good answers to these questions for Cython because we've been using it for long all over the scientific Python ecosystem, but if you can find similar answers for numba it would be tempting to use that instead (the code will almost certainly be simpler to implement and debug).

stefanv avatar Mar 04 '18 22:03 stefanv

One question at large regarding Cython, which will be very useful later: How hard is it to make it work for general N-dimensional arrays? From what I can tell, only arrays of a specific dimension can be an input to a function.

I personally would probably stop spending as much personal time fielding maintenance questions or build/release issues if we go for Cython. This is something that the community would need to pick up long term.

Would you be willing to contribute in code (for the pure Python parts) and do code reviews (for both parts)? FWIW, I only intend to move the bottlenecks (currently only indexing, sorting, matching) to Cython/Numba/C++, keeping all else as-is. I don't intend to write the entire thing in another language, Numpy already does a lot of that for us. I also care a lot about maintainability and attracting devs, I'm sorry if it came across otherwise. Maintainability and community > A few ms shaved off an operation.

If building wheels is an issue, I have access to all three major OSs on my personal machine, and can easily build wheels for all of them (although I believe CircleCI/AppVeyor/Travis do this already)

You're a big part of this project at this stage, losing that could be a fatal blow to it.

hameerabbasi avatar Mar 05 '18 12:03 hameerabbasi

Re nD data structures: All internal data of COO arrays are 1D or 2D, are they not?

nils-werner avatar Mar 05 '18 13:03 nils-werner

There are a few use cases for N-D iteration in our code that I can think of at this point:

  1. Batch DOK assignment.
  2. Advanced indexing.

hameerabbasi avatar Mar 05 '18 13:03 hameerabbasi

portability (exotic stuff like AIX, Raspberry Pi, etc.)

It depends on LLVM via llvmlite, which is quite portable, I think.

maintenance status

Last release was 25 days ago, code frequency seems to be good. Not sure about the future, though. @mrocklin might have insights on that.

hameerabbasi avatar Mar 05 '18 14:03 hameerabbasi

I've raised an issue highlighting debugging issues in Numba here: https://github.com/numba/numba/issues/2788

Engagement there would be welcome.

mrocklin avatar Mar 05 '18 16:03 mrocklin

I do not have any major concerns about the longevity of Numba. Folks probably shouldn't rely on my opinion here due to the conflict of interest. I'd be happy to chat in more depth with folks if desired.

mrocklin avatar Mar 05 '18 16:03 mrocklin

this optimization will allow to compile inner loops in nopython mode regardless of what code surrounds those inner loops.

This was what I was talking about when I spoke of interleaving, so that concern can be put aside. Not too long ago, this wasn't supported IIRC. It also means templates and custom dtypes are supported.

hameerabbasi avatar Mar 05 '18 16:03 hameerabbasi

One question at large regarding Cython, which will be very useful later: How hard is it to make it work for general N-dimensional arrays? From what I can tell, only arrays of a specific dimension can be an input to a function.

Numba has vectorize and guvectorize (which are great!) but in my (outdated) experience it was still easy to run into performance problems when trying to write code that works for any number of dimensions.

Using Cython alone may make this a little trickier since you need to know the number of dimensions at compile time, but there's always the option of using C++ libraries like xtensor which supports dynamic numbers of dimensions.

shoyer avatar Mar 05 '18 16:03 shoyer

Numba has vectorize and guvectorize (which are great!) but in my experience its still easy to run into performance problems when trying to write code that works for any number of dimensions.

I see the last commit to that library was three years ago. Would you be kind enough to re-run those benchmarks? Numba may have changed significantly.

hameerabbasi avatar Mar 05 '18 17:03 hameerabbasi

I see the last commit to that library was three years ago. Would you be kind enough to re-run those benchmarks? Numba may have changed significantly.

Indeed, I'm sure things have improved since then -- certainly numba fixed many of the issues I raised a few years ago.

Unfortunately I don't have a benchmark suite available to run again. I had some notebooks, but those were lost on an old work computer.

It's still fair to say that writing N-dimensional code will be somewhat easier with Numba than Cython.

shoyer avatar Mar 05 '18 17:03 shoyer

I'm sold on Numba. The only things missing now are:

  1. Some support for @jitclass like Cython's cdef (inside classes), for fast dicts (since it needs to be kept in a class, not just a function).
  2. dict acceleration.
  3. SciPy dev opinions.

hameerabbasi avatar Mar 05 '18 19:03 hameerabbasi