arrow icon indicating copy to clipboard operation
arrow copied to clipboard

Add Cython Support to Arrow

Open 13MK3 opened this issue 2 years ago • 5 comments

Pull Request Checklist

Thank you for taking the time to improve Arrow! Before submitting your pull request, please check all appropriate boxes:

  • [ ] 🧪 Added tests for changed code.
  • [x] 🛠️ All tests pass when run locally (run tox or make test to find out!).
  • [x] 🧹 All linting checks pass when run locally (run tox -e lint or make lint to find out!).
  • [ ] 📚 Updated documentation for changed code.
  • [x] ⏩ Code is up-to-date with the master branch.

If you have any questions about your code changes or any of the points above, please submit your questions along with the pull request and we will try our best to help!

Description of Changes

These changes were made in collaboration with @khanm3

  • Modified the setup file so that the entire Arrow library is compiled using Cython
  • Modified the tox.ini file to support code coverage with Cython
  • Added Cython 3.0.0a11 to the requirements.txt
  • Modified arrow.py to compile with Cython and pass the test suite
  • Modified an existing test case to work with Cython since the mocking was broken from Cython caching builtins
  • Added a Makefile target to compile with Cython

This pull request adds Cython support to Arrow as requested by #617.

Further Explanation

In arrow/arrow.py, we had to change the variable "locale" to "locale_cls" because the variable was being reused for 2 different types and this functionality was incompatible with Cython. Cython also had difficulties recognizing the type of _datetime.isoclanedar, so we had to specifically cast it to a tuple. Without this, it would not compile.

Even with the issues mentioned above, we chose to use Cython 3.0.0a11 (which is a pre-release version) instead of Cython 0.29.32 (the most recent stable version), since from our testing, Cython 3 is better at parsing modern Python features. In general, Cython tries to subsume the functionality of python, but as you can see there are still many quirks.

We were able to allow Cython to use the existing method of calculating code coverage by adding linetracing as a compiler directive. However, due to the quirks of Cython, the resulting line coverage results are slightly lower than 100%. We found that the lines that were no longer being covered were one of 2 cases:

  1. Inside an if statement that had 'pragma no cover' (so those also shouldn't count towards coverage)
  2. Parameters to functions that had a default value (which also shouldn't count towards coverage, but for some reason Cython thinks it should)

This could be ignored by adding more 'pragma no cover', but we're not sure if there is a better way.

Performance Analysis

To compare performance of Arrow with and without Cython, we ran just the Arrow part of this benchmark. This benchmark runs 1 million executions per benchmark and picked the min of 5 repeats. This was run in Python 3.8.10.

arrow_benchmark

From these results, you can see that compiling Arrow with Cython does cause a measureable improvement to performance, especially with parsing. Additionally, compiling Cython without linetracing is even more performant, but we currently compile with it so that code coverage can be calculated.

Future Work

Adding static typing (and necessarily converting those files to .pyx files) would bring an even bigger boost to performance. Parser.py should receive special attention, as this is where the largest amount of time is being spent. This is where adding Cython to Arrow could prove to be especially valuable, since the speed increase with static typing should be much more than that gained by simply compiling with Cython. Looking into different ways of calculating code coverage so linetracing is no longer needed, or providing an option to switch the compiler directive would also be a good way to improve performance in the future.

13MK3 avatar Nov 21 '22 03:11 13MK3

Very nice, i had a wip branch of this, but this looks like good progress.

krisfremen avatar Nov 21 '22 05:11 krisfremen

We realized the system tests are failing because the test systems aren't installing Cython before running setup.py. It seems that the continuous_integration.yml is using cached requirements and therefore not installing Cython even though we specified it in requirements.txt? We are not exactly sure and could use some help with this. Cython seems to install fine when we do "make build3x" locally.

13MK3 avatar Nov 21 '22 20:11 13MK3

tox handles setup.py dependencies a bit differently, which we install during the GH Actions.

We should probably extract those out of the workflow and in a separate requirements file.

krisfremen avatar Nov 23 '22 18:11 krisfremen

Codecov Report

Base: 100.00% // Head: 99.59% // Decreases project coverage by -0.40% :warning:

Coverage data is based on head (3d416be) compared to base (74a759b). Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@             Coverage Diff             @@
##            master    #1143      +/-   ##
===========================================
- Coverage   100.00%   99.59%   -0.41%     
===========================================
  Files            9        9              
  Lines         2319     2684     +365     
  Branches       492        0     -492     
===========================================
+ Hits          2319     2673     +354     
- Misses           0       11      +11     
Impacted Files Coverage Δ
arrow/arrow.py 99.54% <100.00%> (-0.46%) :arrow_down:
arrow/constants.py 88.23% <0.00%> (-11.77%) :arrow_down:
arrow/util.py 97.95% <0.00%> (-2.05%) :arrow_down:
arrow/formatter.py 99.03% <0.00%> (-0.97%) :arrow_down:
arrow/parser.py 99.43% <0.00%> (-0.57%) :arrow_down:
arrow/locales.py 99.85% <0.00%> (-0.15%) :arrow_down:
arrow/api.py 100.00% <0.00%> (ø)
arrow/factory.py 100.00% <0.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov[bot] avatar Nov 23 '22 18:11 codecov[bot]

@13MK3 I see 3.8+ is passing, the rest are failing. Great work.

@anishnya @jadchaar @systemcatch how do we want to handle this going forward? do we want to go full cython or offer a pure python version as well?

krisfremen avatar Dec 11 '22 01:12 krisfremen