Daniël de Kok comments

Results 135 comments of


                                            Daniël de Kok

`Spacy` has inconsistency when dividing sentences

One issue you might be running into is that the dependency parser is responsible for finding and setting sentence boundaries in the pretrained spaCy pipelines: https://spacy.io/api/dependencyparser#assigned-attributes If you have your...

`Spacy` has inconsistency when dividing sentences

> I got a bit different, but similar issue This is a different question. Could you open a topic on the discussion forum?

`Spacy` has inconsistency when dividing sentences

> Thank you for your response on the issue. I tried your suggestion and moved the custom segmentation function after the parser in the nlp.pipeline. But I am facing an...

Sentence-terminal periods not tokenized properly in Malayalam text

Thank you for reporting this issue. The tokenizer for Malayalam is currently incomplete, which causes these tokenization issues. A PR with improvements is welcome!

NER: Ensure zero-cost sequence with sentence split in entity

One thing that I am not very sure of: maybe in case 1, `U` should also be a zero-cost transition?

General organisation

> Thanks for creating this repo @LukeMathWalker and for opening these issues @danieldk ! Should we comment there (to keep a history), and then edit the issue description with a...

Existing work: Dependency parsing

I have developed two dependency parsers: * [dpar](https://github.com/danieldk/dpar) is a transition-based dependency parser that uses a Chen & Manning-like feed forward neural network. It's robust (we used it to annotate...

Existing work: Dependency parsing

* sticker updates: sticker supports transformers, pretrained models are available for German/Dutch. Has switched to maintenance mode. * [sticker2](https://github.com/stickeritis/sticker2): we started sticker2 as a successor to sticker: * Uses libtorch...

Existing work: Text normalization

In `conllx-utils` we have a utility (`conllx-cleanup`) that first normalizes unicode and then rewrites some non-ASCII unicode punctuation signs to ASCII: https://github.com/danieldk/conllx-utils/blob/master/src/bin/conllx-cleanup.rs https://github.com/danieldk/conllx-utils/blob/master/src/unicode.rs This helps particularly if the training corpora...

Support of custom features

Excellent question! This is currently not possible and would be hard to add without breaking compatibility with existing models. We are currently working towards a stable 1.0 version, which should...