Adrien Barbaresi
Adrien Barbaresi
The performance is degraded in German for non-greedy searches and the affixes don't improve the greedy mode. They also slows down things a bit. So I would be against it...
I see a difference when I add `"de"` to `AFFIX_LANGS` and lemmatize with `greedy=False`, the rest is the same. I checked again, your examples rather hint at a failing decomposition...
My best guess is that it's just because the decomposition strategy is missing, there can be several ways to break words apart. That being said it could be worth it...
Because it degrades performance on the benchmark, at least in my experiments. Currently the script only evaluates accuracy, it could look different with a f-score but I'm not sure. In...
Yes, evaluating performance on rare words is an issue. We can implement the additional strategy and let the users decide if they want to use it.
Yes, it's indeed not possible in the current state of the package.
Thanks for your feedback, this is tricky because the "updated" pattern is present in the free text and nowhere in the markup.
Something has to be added to the extractors otherwise the `div` element will not be processed (e.g. class contains "news" or "detail").
I meant that someone need to add a precise XPath target, using `//div[contains(text())]` or simply `//div//text()` would be bad for accuracy because random dates in a text are often irrelevant....
I just edited your comment to replace the URL by the raw data, but I still cannot reproduce the bug with XML output, do you use particular options?