Annif icon indicating copy to clipboard operation
Annif copied to clipboard

Switch to Poetry for dependency management

Open juhoinkinen opened this issue 2 years ago • 3 comments

Closes #601.

Besides creating the pyproject.toml file with dependencies, these are done:

  • Move pytest configuration from pytest.ini file to pyproject.toml section/table; delete pytest.ini
  • Delete MANIFEST.in file (the same contents could have been explicitly marked to be included in pyproject.toml, but the contents seemed to be included in the distribution given by poetry build automatically, TODO: needs to be double-checked)
  • Delete now obsolete setup.py
  • Make bumpversion update version string in pyproject.toml
  • Update README.md for usage of Poetry
  • Add poetry.lock to .gitignore
  • Use Poetry in GitHub Actions CI/CD pipeline

Points to note:

  • There is going to be a convenient flag --all-extras to poetry install: https://github.com/python-poetry/poetry/issues/3413
  • The release of Poetry 1.2.0 is close: https://github.com/python-poetry/poetry/issues/5586

For bumping version for release there exists the poetry version command that can bump patch, minor and major parts, but that applies only to the version string in pyproject.toml. There are also poetry plugins with different customizations (also for modifying version strings in any file, e.g. monim67/poetry-bumpversion), but it seems that the current approach with bumpversion is well applicable. Importantly the now-used bumpversion handles the git tag creation and commit, which I did not find any Poetry tool/plugin to be doing, and also allows bumping to *-dev versions. For now the bumpversion config should still be kept in setup.cfg; there is an open issue for pyproject.toml support.

I still try to refine the GH Actions pipeline to make a system-wide installation (I thought this should be working with setting POETRY_VIRTUALENVS_CREATE: false, but it is not) and by adding caching. To verify the release process works, I think a test release to test-PyPI could be done.

juhoinkinen avatar Aug 11 '22 11:08 juhoinkinen

Codecov Report

Merging #605 (4a6e2b5) into master (da680d3) will not change coverage. The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #605   +/-   ##
=======================================
  Coverage   99.61%   99.61%           
=======================================
  Files          87       87           
  Lines        6038     6038           
=======================================
  Hits         6015     6015           
  Misses         23       23           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov[bot] avatar Aug 11 '22 12:08 codecov[bot]

Force-pushed after some commit clean-up.

The poetry install install commands has large time overhead (~15 s), so it seems best to call it only once to install all (regular and optional) dependencies.

Still needs to check the release job with test-pypi.

juhoinkinen avatar Aug 12 '22 13:08 juhoinkinen

I checked with the branch test-pypi-release that the current setup works for publishing releases (using necessary alterations for using test-PyPI project as destination).

juhoinkinen avatar Aug 15 '22 10:08 juhoinkinen

There was weird behaviour in resolving flake8 version: when running poetry install version 2.5.5 was installed, and then running poetry update the version was upgraded to 4.0.1 (with the accompanying dependency updates). This happened also when the project included only autopep8 and flake8. Clearing Poetry cache had no effect.

Pinning autopep8 did not change anything, only pinning flake8 version to 4.* worked: now running poetry update just after install should not find anything to update.

~To do when merged: The Wiki page for installing dependencies for optional features should be updated.~ No need: the current pip commands can still be used.

juhoinkinen avatar Aug 16 '22 10:08 juhoinkinen

The installation of Poetry itself is now instructed in README to be done by curling and running the installation script from https://install.python-poetry.org/, and in GH Actions the install-poetry action does the same.

Poetry documentation states also pipx and manual (with pip and venv) ways to install Poetry. They could be a bit more secure, but also less convenient ways.

Maybe the Poetry installation in GH Actions could be done without the ready-made action but with just curling the script and running md5sum check for it?

juhoinkinen avatar Aug 16 '22 11:08 juhoinkinen

I tested this briefly on two machines: a desktop with fixed Ethernet and a laptop with Wi-Fi. For both machines, the "Resolving dependencies" step in poetry install took a long time: 108 seconds for the desktop and 134 seconds for the laptop. In my understanding, this isn't something you'd do every day, so maybe it's not too worrisome. But it's quite a bit slower than the pip equivalent.

More worrying is that the install slowdown also affects CI jobs. The latest run with Poetry took 6 minutes 42 seconds, while a recent non-Poetry run took 3 minutes 55 seconds. So the CI run now takes almost three minutes longer! I wonder if there's anything that could still be done to speed up the installation step under CI?

However, slow install seems to be a known issue in Poetry (and there are many other similar but more recent issues). It's also the first question in the FAQ. According to the FAQ, the main reason for the slow operation is the lack of dependency metadata on PyPI, which leads to lots of extra work downloading and inspecting individual package versions from PyPI. This gives me two ideas for possible improvements of this PR:

  1. Specify the versions of packages we depend on more strictly. No * or >x.x allowed! There's another FAQ entry about this. Maybe it will reduce the work Poetry has to perform.
  2. It appears that Poetry will cache the information it has painstakingly collected from PyPI. For example, if I delete poetry.lock and run poetry install again, it only takes 6-7 seconds on both machines. Try to keep that cache under GitHub Actions!

I like the way Poetry shows outdated packages a lot:

$ poetry show --outdated
autopep8             1.6.0  1.7.0  A tool that automatically formats Python code to conform to the PEP 8 style guide
blis                 0.7.8  0.9.1  The Blis BLAS-like linear algebra library, as a self-contained C-extension.
catalogue            2.0.8  2.1.0  Super lightweight function registries for your library
click                8.0.4  8.1.3  Composable command line interface toolkit
coverage             6.2    6.4.3  Code coverage measurement for Python
flake8               4.0.1  5.0.4  the modular source code checker: pep8 pyflakes and co
flatbuffers          1.12   2.0    The FlatBuffers serialization format for Python
gast                 0.4.0  0.5.3  Python AST that abstracts the underlying Python version
google-auth-oauthlib 0.4.6  0.5.2  Google Authentication Library
mccabe               0.6.1  0.7.0  McCabe checker, plugin for flake8
protobuf             3.19.4 4.21.5 Protocol Buffers
pycodestyle          2.8.0  2.9.1  Python style guide checker
pydantic             1.8.2  1.9.2  Data validation and settings management using python 3.6 type hinting
pyflakes             2.4.0  2.5.0  passive checker of Python programs
scikit-learn         1.1.1  1.1.2  A set of python modules for machine learning and data mining
scipy                1.8.1  1.9.0  SciPy: Scientific Library for Python
simplemma            0.7.0  0.8.0  A simple multilingual lemmatizer for Python.
smart-open           5.2.1  6.0.0  Utils for streaming large files (S3, HDFS, GCS, Azure Blob Storage, gzip, bz2...)
spacy                3.3.1  3.4.1  Industrial-strength Natural Language Processing (NLP) in Python
tensorboard          2.9.1  2.10.0 TensorBoard lets you watch Tensors Flow
thinc                8.0.17 8.1.0  A refreshing functional take on deep learning, compatible with your favorite libraries
typer                0.4.2  0.6.1  Typer, build great CLIs. Easy to code. Based on Python type hints.
yake                 0.4.5  0.4.8  Keyword extraction Python package

(the color coding gets lost in the copy&paste)

osma avatar Aug 16 '22 13:08 osma

I tested re-running the CI jobs for this PR. First attempt took 5min 30s, second took 5min 35s. So not quite as slow as the first time, but still much slower than without Poetry. The "Install Python dependencies" step seems to take around 2 minutes for all three Python versions.

osma avatar Aug 16 '22 14:08 osma

I see that there are quite a few Poetry-related actions in the GitHub Actions Marketplace: https://github.com/marketplace?type=actions&query=poetry+

Maybe one of the others would be faster or otherwise better than the current one?

osma avatar Aug 17 '22 07:08 osma

  1. Specify the versions of packages we depend on more strictly. No * or >x.x allowed! There's another FAQ entry about this. Maybe it will reduce the work Poetry has to perform.

This did not help (at least much). With strict pinning: Version solving took 93.671 seconds. Previously, without strict pinning, this took ~100 seconds.

  1. It appears that Poetry will cache the information it has painstakingly collected from PyPI. For example, if I delete poetry.lock and run poetry install again, it only takes 6-7 seconds on both machines. Try to keep that cache under GitHub Actions!

This worked! Caching directory ~/.cache/pypoetry/cache/repositories makes the dependencies resolution to go in few seconds, and the CI runs overall in less than 3 minutes.

A bit surprising that this was not cached already, as the virtual environment was, by the setup-python action feature. Well, that cache is stated to apply to virtual environment (only) as stated in the docs, but I missed that.

juhoinkinen avatar Aug 17 '22 09:08 juhoinkinen

I changed README.md to instruct to use pipx instead curl|bash for installing Poetry, as pipx might be the way for Annif installations for end-users in the future.

Without layer caching building Docker image with Poetry is slow (6-7 mins in GH Actions), but actually has not been much faster with pip. The layer caching could be implemented in GH Actions (an example), but that can be a different PR, and the Docker build occurs anyway only on pushes to master, not to every push to PR branches.

juhoinkinen avatar Sep 06 '22 12:09 juhoinkinen

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

sonarqubecloud[bot] avatar Sep 13 '22 08:09 sonarqubecloud[bot]