simplemma icon indicating copy to clipboard operation
simplemma copied to clipboard

Plans for simplemma 1.0 release?

Open osma opened this issue 2 years ago • 9 comments

Hi, we are soon going to release Annif 1.0. We are currently depending on simplemma 0.9.1, which is the last released version, released on January 20th.

Since then a lot has happened in the simplemma main branch, including many enhancements by @juanjoDiaz . I've also seen occasional references to a 1.0 release in comments on issues and PRs (for example in #64) and a few issues are tagged with milestone 1.0.

We would like to know whether it makes sense to wait for another simplemma release (which would enable some new features, thanks to the ability to better control memory usage and caches) or whether to release Annif 1.0 with the current simplemma 0.9.1 dependency. So my questions are:

  1. Is the next release going to be simplemma 1.0, or are you still going to do another 0.x release?
  2. Do you have any concrete plans for when the next release is going to happen?
  3. Is there anything we could do to help with getting out the new release?

Thanks in advance & sorry for asking difficult questions :)

osma avatar Aug 07 '23 11:08 osma

Hi @osma, thanks for your feedback, it's an important question indeed. When do you plan to release Annif 1.0?

As far as I know there are at least two (currently open) remaining PRs before releasing version 1. Do you have something else in mind @juanjoDiaz ?


Regarding your questions:

  1. It is not established the next version has to be 1.0.
  2. There are concrete plans but as we are not working full time on this project we are regularly behind schedule anyway.
  3. Thanks for asking! You could test the current repository version in your CI/CD environment. and report on bugs if something isn't working as usual. If you wish to use the new classes you could check the documentation and see if you can work with it. (In your setup you can use simplemma @ git+https://github.com/adbar/simplemma as a dependency to get the latest state.)

Since everything should be working as it should on the side of already documented funtions I see the following options. This is to make sure you (@osma) can benefit from the latest changes without having to hassle everyone:

  • Issue a version 0.9.2 as of now to pass on the changes
  • Issue a beta version 1 (not exactly sure how but it's possible) that can be referenced by external users
  • Issue a version 1.0.0 even without the full docs and a 1.0.1 with the docs and optional additional improvements

adbar avatar Aug 08 '23 13:08 adbar

When do you plan to release Annif 1.0?

We don't have a set date, but would like to release by the end of August or so. The reason I asked about Simplemma was that we could plan for the release accordingly.

It is not established the next version has to be 1.0. [...] Issue a version 0.9.2 as of now to pass on the changes

I think there have been quite a few changes to the internals as well as to the external APIs. Of course in 0.x versions anything is permitted under SemVer, but I think 0.10.0 would be a more fitting version number if you want to make an intermediate release that is not yet 1.0.0.

You could test the current repository version in your CI/CD environment. and report on bugs if something isn't working as usual.

Yes, that's a good idea, let's try to do that!

If you wish to use the new classes you could check the documentation and see if you can work with it.

Yes, right. We have some outstanding work in a PR which adds language detection to the Annif REST API that hasn't been merged yet because we weren't comfortable with the lack of control over memory usage in simplemma 0.9.1. If we can upgrade to a more recent simplemma release, then we could revive that PR and add an important feature to Annif. However, it doesn't seem like a good idea to depend on a git version of simplemma for anything else than testing, so this could only be merged after you have made a release that we can depend on.

osma avatar Aug 09 '23 13:08 osma

I agree that using the repo version should only be used for tests. I'll wait for your feedback on core functionality and them we'll aim for v1, with or without full docs depending on the time at hand.

adbar avatar Aug 10 '23 10:08 adbar

I created a draft PR for Annif where I first upgraded the simplemma dependency to the latest github main branch and then did some further adjustments to the calling code, in order to benefit from the possibility of limiting the number of cached dictionaries that Simplemma keeps in memory from the default 8 to a smaller number (5 for now).

The good news

  1. When I upgraded the dependency (first commit in the PR), nothing broke. So the old code that relied on simplemma.lemmatize and in_target_language still kept working.
  2. I did a little bit of very unscientific benchmarking, and the new code seems to be noticeably faster. The runtime of my little benchmark suite (training and evaluating two different Annif models which use simplemma for Finnish language lemmatization) decreased from 1176 seconds to 1009 seconds, so it's over 15% faster - and mostly the code is busy doing other things than lemmatization, so simplemma itself has improved more than that. Memory usage remained the same, as did the quality of evaluation results.

The not-so-good news

I appreciate the flexibility offered by the new Lemmatizer and LanguageDetector classes - especially the ability to limit the number of dictionaries kept in memory, because in a long-running process such as Annif running in server mode, loading many dictionaries into memory can take up several GBs of RAM for no good reason.

But it wasn't very easy to make use of this feature, because you cannot use the simple function API anymore (which is the only one documented in the simplemma README!) but instead have to set up custom class instances and also ensure that all access to language dictionaries goes through a single (Default)DictionaryFactory that has the desired max_cache_size setting. I ended up implementing a new annif.simplemma_util module which instantiates a dictionary factory, strategy and lemmatizer, and provides a function for creating language detectors. The rest of the code in Annif will then have to make use of this module instead of calling simplemma functions directly (not necessarily a bad thing).

I guess this is simply the price you have to pay for advanced usage, but I suppose there could have been another way of controlling the max_cache_size, through some sort of global setting perhaps (e.g. simplemma.set_max_cache_size(5) before calling any simplemma functions). Global state like that is of course ugly and should generally be avoided.

Verdict

From my perspective, there were no show-stoppers for making a new release. It would be nice to have documentation for advanced usage of the classes, as I now had to figure out how to use DictionaryFactory, DefaultStrategy, Lemmatizer and LanguageDetector by examining the simplemma source code. But that is not a reason for delaying a release IMHO. If it helps, I could also provide a PR that adds a small "advanced usage" example into the simplemma top level README based on what I learned when implementing the annif.simplemma_util wrapper module.

osma avatar Aug 10 '23 14:08 osma

Hi @osma, thank you very much for taking the time to review the changes! Nice to read that everything works as expected, it was hard work.

We can talk later with @juanjoDiaz about the global setting, the set_max_cache_size parameter makes sense but I'm not sure the purpose of classes and customization was to add further configuration layers to the existing functions...

Yes, it would be great if you add advanced examples to the current README, I believe PR #109 describes the classes and functions but it doesn't provide concrete examples on how to use them.

adbar avatar Aug 10 '23 16:08 adbar

Yes, it would be great if you add advanced examples to the current README, I believe PR https://github.com/adbar/simplemma/pull/109 describes the classes and functions but it doesn't provide concrete examples on how to use them.

I opened PR #113 which adds a section like this to the README.

osma avatar Aug 11 '23 07:08 osma

FWIW, I was unsure about the benchmark results reported above. because I wasn't very rigorous in setting up the test on my laptop. So I repeated the test on an otherwise idle server. The results were similar - the main branch is clearly faster than 0.9.1, but the difference was much smaller this time, around 4% in overall execution time (which, as noted, mostly consists of of other kinds of processing than lemmatization). So Simplemma has become faster, but not as dramatically as I first thought.

osma avatar Aug 11 '23 09:08 osma

Thank you for the PR!

Concerning the benchmark I can confirm that there could be a small improvement but nothing much as the main logic hasn't really changed. If you use Python 3.11 you'll also see an improvement in execution speed, simply because this Python version is faster.

adbar avatar Aug 11 '23 11:08 adbar

I used the same Python version and also all the other packages were the same. But now I realized that I had forgot to check one aspect, the version of the subject vocabulary used for the experiment. And it turns out I had used different versions of YSO for the two tests. I re-ran the benchmark using the same version of the vocabulary, and even the 4% difference disappeared - instead there was a very slight increase but less than 1%, so nothing to worry about.

osma avatar Aug 11 '23 11:08 osma

I now seriously plan a next release before the summer, if everything goes well even earlier than that.

The remaining issues will (hopefully) be addressed further along the way, from a functional point of view nothing stands in the way (anymore) once the issues labeled with v1.0 as milestone are cleared.

adbar avatar May 22 '24 14:05 adbar