Add README section on advanced usage via classes
As discussed in https://github.com/adbar/simplemma/issues/110#issuecomment-1673306133, this PR adds a section to the top level README with examples of advanced usage via the Simplemma classes. I used the cache limiting use case I had in mind for the example, but I tried to explains it as a pattern that can be applied also for other customization requirements. Any comments are welcome, I'm happy to adjust as necessary.
While working on this, I discovered some problems that I reported as separate issues #111 and #112.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 81.12%. Comparing base (
fa1d964) to head (5c5a34c). Report is 5 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #113 +/- ##
===========================================
- Coverage 96.62% 81.12% -15.50%
===========================================
Files 33 35 +2
Lines 651 779 +128
===========================================
+ Hits 629 632 +3
- Misses 22 147 +125
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
One aspect of the class API I'm wondering about is the difference between Lemmatizer and LanguageDetector in that Lemmatizer is language-agnostic (the same instance can be used with many languages/combinations, just passing a different lang argument to the lemmatize() method) while LanguageDetector is given a language (or tuple of languages) when it's constructed, so the same instance cannot be reused if you happen to need a different set of languages.
Neither way is wrong, but it seems like these could perhaps be harmonized - either by making Lemmatizer language-specific, or by making LanguageDetector language-agnostic. Maybe @juanjoDiaz has an explanation for the current situation and whether it makes sense to keep it as it is or to try to unify these.
Thanks for the added docs and good point above, you could actually open an issue regarding the harmonization of Lemmatizer and LanguageDetector. It's not the priority now though, so we can add a corresponding sentence in the docs if you feel users might fail to understand the current difference.