Andreas van Cranenburgh

Results 23 comments of Andreas van Cranenburgh

Not yet, looks interesting. Are such head rules available for English/PTB as well?

I wrote a conversion script from XML to CoNLL 2012: https://gist.github.com/andreasvc/6bf9e10b2e6956ce32fb777e7efe99cb

Would there be interest in including [syntok](https://github.com/fnl/syntok/) as the default sentence splitter and word tokenizer? It is pure Python.

I saw it, it sounds more complicated requiring big data files etc. syntok is a simple, self-contained multilingual regex sentence splitter and tokenizer, which keeps track of the original string...

@goodmami that sounds good. Is it multilingual? You can get _decent_ tokenization from a language-independent tokenizer, but good results require some language-specific rules/data. And from your assessment it does sound...

My aim with proposing to incorporate syntok is to have a simple default tokenizer/splitter which is (a little) better than the current anglocentric (TreebankTokenizer) or unsupervised (PunktSentenceTokenizer) default. The question...

I just noticed the issue title says "for English" so talking about language independence and multilinguality is a bit offtopic...

Figured it out, easier than expected: https://github.com/andreasvc/sdsl-lite/commit/cd04d2f32c86e44d7dcd5bd4ee348cc9e9435cd0 Still wondering if it should be the default though.

Indeed that's what I did in https://github.com/andreasvc/sdsl-lite/commit/cd04d2f32c86e44d7dcd5bd4ee348cc9e9435cd0 Do you want me to make a pull request or is a static build preferable as default?

This one just bit me again, v.1.15.0-beta. I've been waiting for 15 minutes ... is there still hope to get my unsaved work back? I searched for a single letter,...