Tokenizer icon indicating copy to clipboard operation
Tokenizer copied to clipboard

A tokenizer for Icelandic text

Results 8 Tokenizer issues
Sort by recently updated
recently updated
newest added

We could use having better code coverage of tests. There is an easy way to get the coverage: There is a pytest plugin called pytest-cov that generates coverage reports. ```...

The tokenizer should support superscripted citation characters. This will also help with GreynirCorrect, which I assume will be heavily used to read student essays and academic papers.

abbrev.py:40: DeprecationWarning: pkg_resources is deprecated as an API

Using the newest version of Tokenizer, 3.4.2: ``` from tokenizer import correct_spaces >>> correct_spaces('Þarna voru t.d. tveir hundar , m.a. hundurinn hans Jóns .') # Expected output: 'Þarna voru t.d....

Improved handling for colon-separated times and durations in `correct_spaces`. Added tests for this too. Previously it added spaces after all colons resulting in wrong time formats, e.g. "kl. 9:40" ->...

A. @vthorsteinsson I see you added OrderedDict (and OrderedSet) in late 2019, when 3.6 was around without dict then not ordered by default. If you only support 3.7 and higher,...

Using the newest version of Tokenizer, 3.4.5: ``` >>> from tokenizer import split_into_sentences, detokenize, tokenize, correct_spaces # En dash and detokenize >>> sent = 'Hamarinn dugir – og meira en...