Daniël de Kok
Daniël de Kok
One issue you might be running into is that the dependency parser is responsible for finding and setting sentence boundaries in the pretrained spaCy pipelines: https://spacy.io/api/dependencyparser#assigned-attributes If you have your...
> I got a bit different, but similar issue This is a different question. Could you open a topic on the discussion forum?
> Thank you for your response on the issue. I tried your suggestion and moved the custom segmentation function after the parser in the nlp.pipeline. But I am facing an...
Thank you for reporting this issue. The tokenizer for Malayalam is currently incomplete, which causes these tokenization issues. A PR with improvements is welcome!
One thing that I am not very sure of: maybe in case 1, `U` should also be a zero-cost transition?
> Thanks for creating this repo @LukeMathWalker and for opening these issues @danieldk ! Should we comment there (to keep a history), and then edit the issue description with a...
I have developed two dependency parsers: * [dpar](https://github.com/danieldk/dpar) is a transition-based dependency parser that uses a Chen & Manning-like feed forward neural network. It's robust (we used it to annotate...
* sticker updates: sticker supports transformers, pretrained models are available for German/Dutch. Has switched to maintenance mode. * [sticker2](https://github.com/stickeritis/sticker2): we started sticker2 as a successor to sticker: * Uses libtorch...
In `conllx-utils` we have a utility (`conllx-cleanup`) that first normalizes unicode and then rewrites some non-ASCII unicode punctuation signs to ASCII: https://github.com/danieldk/conllx-utils/blob/master/src/bin/conllx-cleanup.rs https://github.com/danieldk/conllx-utils/blob/master/src/unicode.rs This helps particularly if the training corpora...
Excellent question! This is currently not possible and would be hard to add without breaking compatibility with existing models. We are currently working towards a stable 1.0 version, which should...