Tokenizer issues

Support colon-separated duration?

E.g.:

Not enough test coverage

We could use having better code coverage of tests. There is an easy way to get the coverage: There is a pytest plugin called pytest-cov that generates coverage reports. ```...

peturorri

Support for citation characters

The tokenizer should support superscripted citation characters. This will also help with GreynirCorrect, which I assume will be heavily used to read student essays and academic papers.

sveinbjornt

pkg_resources is deprecated

abbrev.py:40: DeprecationWarning: pkg_resources is deprecated as an API

sveinbjornt

correct_spaces incorrectly inserts spaces into abbreviations

1

Using the newest version of Tokenizer, 3.4.2: ``` from tokenizer import correct_spaces >>> correct_spaces('Þarna voru t.d. tveir hundar , m.a. hundurinn hans Jóns .') # Expected output: 'Þarna voru t.d....

atlijas

Fix/colon time correct spaces

1

Improved handling for colon-separated times and durations in `correct_spaces`. Added tests for this too. Previously it added spaces after all colons resulting in wrong time formats, e.g. "kl. 9:40" ->...

gardarjuto

OrderedDict not needed, and question and comment

A. @vthorsteinsson I see you added OrderedDict (and OrderedSet) in late 2019, when 3.6 was around without dict then not ordered by default. If you only support 3.7 and higher,...

PallHaraldsson

detokenize and correct_spaces problem with hyphens and En dashes

Using the newest version of Tokenizer, 3.4.5: ``` >>> from tokenizer import split_into_sentences, detokenize, tokenize, correct_spaces # En dash and detokenize >>> sent = 'Hamarinn dugir – og meira en...

atlijas

Tokenizer
Tokenizer copied to clipboard

Metadata

Support colon-separated duration?

Not enough test coverage

Support for citation characters

pkg_resources is deprecated

correct_spaces incorrectly inserts spaces into abbreviations

Fix/colon time correct spaces

OrderedDict not needed, and question and comment

detokenize and correct_spaces problem with hyphens and En dashes

← Metadata

Owner

Metadata

Tokenizer Tokenizer copied to clipboard

Metadata

← Metadata

Owner

Metadata

Tokenizer
Tokenizer copied to clipboard