arrow
arrow copied to clipboard
Add Cython Support to Arrow
Pull Request Checklist
Thank you for taking the time to improve Arrow! Before submitting your pull request, please check all appropriate boxes:
- [ ] 🧪 Added tests for changed code.
- [x] 🛠️ All tests pass when run locally (run
tox
ormake test
to find out!). - [x] 🧹 All linting checks pass when run locally (run
tox -e lint
ormake lint
to find out!). - [ ] 📚 Updated documentation for changed code.
- [x] ⏩ Code is up-to-date with the
master
branch.
If you have any questions about your code changes or any of the points above, please submit your questions along with the pull request and we will try our best to help!
Description of Changes
These changes were made in collaboration with @khanm3
- Modified the setup file so that the entire Arrow library is compiled using Cython
- Modified the tox.ini file to support code coverage with Cython
- Added Cython 3.0.0a11 to the requirements.txt
- Modified arrow.py to compile with Cython and pass the test suite
- Modified an existing test case to work with Cython since the mocking was broken from Cython caching builtins
- Added a Makefile target to compile with Cython
This pull request adds Cython support to Arrow as requested by #617.
Further Explanation
In arrow/arrow.py, we had to change the variable "locale" to "locale_cls" because the variable was being reused for 2 different types and this functionality was incompatible with Cython. Cython also had difficulties recognizing the type of _datetime.isoclanedar, so we had to specifically cast it to a tuple. Without this, it would not compile.
Even with the issues mentioned above, we chose to use Cython 3.0.0a11 (which is a pre-release version) instead of Cython 0.29.32 (the most recent stable version), since from our testing, Cython 3 is better at parsing modern Python features. In general, Cython tries to subsume the functionality of python, but as you can see there are still many quirks.
We were able to allow Cython to use the existing method of calculating code coverage by adding linetracing as a compiler directive. However, due to the quirks of Cython, the resulting line coverage results are slightly lower than 100%. We found that the lines that were no longer being covered were one of 2 cases:
- Inside an if statement that had 'pragma no cover' (so those also shouldn't count towards coverage)
- Parameters to functions that had a default value (which also shouldn't count towards coverage, but for some reason Cython thinks it should)
This could be ignored by adding more 'pragma no cover', but we're not sure if there is a better way.
Performance Analysis
To compare performance of Arrow with and without Cython, we ran just the Arrow part of this benchmark. This benchmark runs 1 million executions per benchmark and picked the min of 5 repeats. This was run in Python 3.8.10.
From these results, you can see that compiling Arrow with Cython does cause a measureable improvement to performance, especially with parsing. Additionally, compiling Cython without linetracing is even more performant, but we currently compile with it so that code coverage can be calculated.
Future Work
Adding static typing (and necessarily converting those files to .pyx files) would bring an even bigger boost to performance. Parser.py should receive special attention, as this is where the largest amount of time is being spent. This is where adding Cython to Arrow could prove to be especially valuable, since the speed increase with static typing should be much more than that gained by simply compiling with Cython. Looking into different ways of calculating code coverage so linetracing is no longer needed, or providing an option to switch the compiler directive would also be a good way to improve performance in the future.
Very nice, i had a wip branch of this, but this looks like good progress.
We realized the system tests are failing because the test systems aren't installing Cython before running setup.py. It seems that the continuous_integration.yml is using cached requirements and therefore not installing Cython even though we specified it in requirements.txt? We are not exactly sure and could use some help with this. Cython seems to install fine when we do "make build3x" locally.
tox handles setup.py dependencies a bit differently, which we install during the GH Actions.
We should probably extract those out of the workflow and in a separate requirements file.
Codecov Report
Base: 100.00% // Head: 99.59% // Decreases project coverage by -0.40%
:warning:
Coverage data is based on head (
3d416be
) compared to base (74a759b
). Patch coverage: 100.00% of modified lines in pull request are covered.
Additional details and impacted files
@@ Coverage Diff @@
## master #1143 +/- ##
===========================================
- Coverage 100.00% 99.59% -0.41%
===========================================
Files 9 9
Lines 2319 2684 +365
Branches 492 0 -492
===========================================
+ Hits 2319 2673 +354
- Misses 0 11 +11
Impacted Files | Coverage Δ | |
---|---|---|
arrow/arrow.py | 99.54% <100.00%> (-0.46%) |
:arrow_down: |
arrow/constants.py | 88.23% <0.00%> (-11.77%) |
:arrow_down: |
arrow/util.py | 97.95% <0.00%> (-2.05%) |
:arrow_down: |
arrow/formatter.py | 99.03% <0.00%> (-0.97%) |
:arrow_down: |
arrow/parser.py | 99.43% <0.00%> (-0.57%) |
:arrow_down: |
arrow/locales.py | 99.85% <0.00%> (-0.15%) |
:arrow_down: |
arrow/api.py | 100.00% <0.00%> (ø) |
|
arrow/factory.py | 100.00% <0.00%> (ø) |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
@13MK3 I see 3.8+ is passing, the rest are failing. Great work.
@anishnya @jadchaar @systemcatch how do we want to handle this going forward? do we want to go full cython or offer a pure python version as well?