Annif
Annif copied to clipboard
Switch to Poetry for dependency management
Closes #601.
Besides creating the pyproject.toml
file with dependencies, these are done:
- Move pytest configuration from pytest.ini file to pyproject.toml section/table; delete pytest.ini
- Delete MANIFEST.in file (the same contents could have been explicitly marked to be included in pyproject.toml, but the contents seemed to be included in the distribution given by
poetry build
automatically, TODO: needs to be double-checked) - Delete now obsolete setup.py
- Make bumpversion update version string in pyproject.toml
- Update README.md for usage of Poetry
- Add poetry.lock to .gitignore
- Use Poetry in GitHub Actions CI/CD pipeline
Points to note:
- There is going to be a convenient flag
--all-extras
topoetry install
: https://github.com/python-poetry/poetry/issues/3413 - The release of Poetry 1.2.0 is close: https://github.com/python-poetry/poetry/issues/5586
For bumping version for release there exists the poetry version
command that can bump patch, minor and major parts, but that applies only to the version string in pyproject.toml
. There are also poetry plugins with different customizations (also for modifying version strings in any file, e.g. monim67/poetry-bumpversion), but it seems that the current approach with bumpversion is well applicable. Importantly the now-used bumpversion handles the git tag creation and commit, which I did not find any Poetry tool/plugin to be doing, and also allows bumping to *-dev
versions. For now the bumpversion config should still be kept in setup.cfg; there is an open issue for pyproject.toml support.
I still try to refine the GH Actions pipeline to make a system-wide installation (I thought this should be working with setting POETRY_VIRTUALENVS_CREATE: false
, but it is not) and by adding caching. To verify the release process works, I think a test release to test-PyPI could be done.
Codecov Report
Merging #605 (4a6e2b5) into master (da680d3) will not change coverage. The diff coverage is
n/a
.
@@ Coverage Diff @@
## master #605 +/- ##
=======================================
Coverage 99.61% 99.61%
=======================================
Files 87 87
Lines 6038 6038
=======================================
Hits 6015 6015
Misses 23 23
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
Force-pushed after some commit clean-up.
The poetry install
install commands has large time overhead (~15 s), so it seems best to call it only once to install all (regular and optional) dependencies.
Still needs to check the release job with test-pypi.
I checked with the branch test-pypi-release that the current setup works for publishing releases (using necessary alterations for using test-PyPI project as destination).
There was weird behaviour in resolving flake8 version: when running poetry install
version 2.5.5 was installed, and then running poetry update
the version was upgraded to 4.0.1 (with the accompanying dependency updates). This happened also when the project included only autopep8 and flake8. Clearing Poetry cache had no effect.
Pinning autopep8 did not change anything, only pinning flake8 version to 4.* worked: now running poetry update
just after install should not find anything to update.
~To do when merged: The Wiki page for installing dependencies for optional features should be updated.~ No need: the current pip commands can still be used.
The installation of Poetry itself is now instructed in README to be done by curling and running the installation script from https://install.python-poetry.org/, and in GH Actions the install-poetry action does the same.
Poetry documentation states also pipx and manual (with pip and venv) ways to install Poetry. They could be a bit more secure, but also less convenient ways.
Maybe the Poetry installation in GH Actions could be done without the ready-made action but with just curling the script and running md5sum check for it?
I tested this briefly on two machines: a desktop with fixed Ethernet and a laptop with Wi-Fi. For both machines, the "Resolving dependencies" step in poetry install
took a long time: 108 seconds for the desktop and 134 seconds for the laptop. In my understanding, this isn't something you'd do every day, so maybe it's not too worrisome. But it's quite a bit slower than the pip equivalent.
More worrying is that the install slowdown also affects CI jobs. The latest run with Poetry took 6 minutes 42 seconds, while a recent non-Poetry run took 3 minutes 55 seconds. So the CI run now takes almost three minutes longer! I wonder if there's anything that could still be done to speed up the installation step under CI?
However, slow install seems to be a known issue in Poetry (and there are many other similar but more recent issues). It's also the first question in the FAQ. According to the FAQ, the main reason for the slow operation is the lack of dependency metadata on PyPI, which leads to lots of extra work downloading and inspecting individual package versions from PyPI. This gives me two ideas for possible improvements of this PR:
- Specify the versions of packages we depend on more strictly. No
*
or>x.x
allowed! There's another FAQ entry about this. Maybe it will reduce the work Poetry has to perform. - It appears that Poetry will cache the information it has painstakingly collected from PyPI. For example, if I delete
poetry.lock
and runpoetry install
again, it only takes 6-7 seconds on both machines. Try to keep that cache under GitHub Actions!
I like the way Poetry shows outdated packages a lot:
$ poetry show --outdated
autopep8 1.6.0 1.7.0 A tool that automatically formats Python code to conform to the PEP 8 style guide
blis 0.7.8 0.9.1 The Blis BLAS-like linear algebra library, as a self-contained C-extension.
catalogue 2.0.8 2.1.0 Super lightweight function registries for your library
click 8.0.4 8.1.3 Composable command line interface toolkit
coverage 6.2 6.4.3 Code coverage measurement for Python
flake8 4.0.1 5.0.4 the modular source code checker: pep8 pyflakes and co
flatbuffers 1.12 2.0 The FlatBuffers serialization format for Python
gast 0.4.0 0.5.3 Python AST that abstracts the underlying Python version
google-auth-oauthlib 0.4.6 0.5.2 Google Authentication Library
mccabe 0.6.1 0.7.0 McCabe checker, plugin for flake8
protobuf 3.19.4 4.21.5 Protocol Buffers
pycodestyle 2.8.0 2.9.1 Python style guide checker
pydantic 1.8.2 1.9.2 Data validation and settings management using python 3.6 type hinting
pyflakes 2.4.0 2.5.0 passive checker of Python programs
scikit-learn 1.1.1 1.1.2 A set of python modules for machine learning and data mining
scipy 1.8.1 1.9.0 SciPy: Scientific Library for Python
simplemma 0.7.0 0.8.0 A simple multilingual lemmatizer for Python.
smart-open 5.2.1 6.0.0 Utils for streaming large files (S3, HDFS, GCS, Azure Blob Storage, gzip, bz2...)
spacy 3.3.1 3.4.1 Industrial-strength Natural Language Processing (NLP) in Python
tensorboard 2.9.1 2.10.0 TensorBoard lets you watch Tensors Flow
thinc 8.0.17 8.1.0 A refreshing functional take on deep learning, compatible with your favorite libraries
typer 0.4.2 0.6.1 Typer, build great CLIs. Easy to code. Based on Python type hints.
yake 0.4.5 0.4.8 Keyword extraction Python package
(the color coding gets lost in the copy&paste)
I tested re-running the CI jobs for this PR. First attempt took 5min 30s, second took 5min 35s. So not quite as slow as the first time, but still much slower than without Poetry. The "Install Python dependencies" step seems to take around 2 minutes for all three Python versions.
I see that there are quite a few Poetry-related actions in the GitHub Actions Marketplace: https://github.com/marketplace?type=actions&query=poetry+
Maybe one of the others would be faster or otherwise better than the current one?
- Specify the versions of packages we depend on more strictly. No
*
or>x.x
allowed! There's another FAQ entry about this. Maybe it will reduce the work Poetry has to perform.
This did not help (at least much). With strict pinning: Version solving took 93.671 seconds.
Previously, without strict pinning, this took ~100 seconds.
- It appears that Poetry will cache the information it has painstakingly collected from PyPI. For example, if I delete poetry.lock and run poetry install again, it only takes 6-7 seconds on both machines. Try to keep that cache under GitHub Actions!
This worked! Caching directory ~/.cache/pypoetry/cache/repositories
makes the dependencies resolution to go in few seconds, and the CI runs overall in less than 3 minutes.
A bit surprising that this was not cached already, as the virtual environment was, by the setup-python action feature. Well, that cache is stated to apply to virtual environment (only) as stated in the docs, but I missed that.
I changed README.md to instruct to use pipx instead curl|bash for installing Poetry, as pipx might be the way for Annif installations for end-users in the future.
Without layer caching building Docker image with Poetry is slow (6-7 mins in GH Actions), but actually has not been much faster with pip. The layer caching could be implemented in GH Actions (an example), but that can be a different PR, and the Docker build occurs anyway only on pushes to master, not to every push to PR branches.