Adrien Barbaresi comments

Results 412 comments of


                                            Adrien Barbaresi

Add a compound splitting strategy to improve on affix decomposition

The performance is degraded in German for non-greedy searches and the affixes don't improve the greedy mode. They also slows down things a bit. So I would be against it...

Add a compound splitting strategy to improve on affix decomposition

I see a difference when I add `"de"` to `AFFIX_LANGS` and lemmatize with `greedy=False`, the rest is the same. I checked again, your examples rather hint at a failing decomposition...

Add a compound splitting strategy to improve on affix decomposition

My best guess is that it's just because the decomposition strategy is missing, there can be several ways to break words apart. That being said it could be worth it...

Add a compound splitting strategy to improve on affix decomposition

Because it degrades performance on the benchmark, at least in my experiments. Currently the script only evaluates accuracy, it could look different with a f-score but I'm not sure. In...

Add a compound splitting strategy to improve on affix decomposition

Yes, evaluating performance on rare words is an issue. We can implement the additional strategy and let the users decide if they want to use it.

Returning all possible lemmas for a single word

Yes, it's indeed not possible in the current state of the package.

Add "updated …" pattern to the extractors

Thanks for your feedback, this is tricky because the "updated" pattern is present in the free text and nowhere in the markup.

The script does not find the date (Russian)

Something has to be added to the extractors otherwise the `div` element will not be processed (e.g. class contains "news" or "detail").

The script does not find the date (Russian)

I meant that someone need to add a precise XPath target, using `//div[contains(text())]` or simply `//div//text()` would be bad for accuracy because random dates in a text are often irrelevant....

ValueError in xml

I just edited your comment to replace the URL by the raw data, but I still cannot reproduce the bug with XML output, do you use particular options?