zemberek-nlp
zemberek-nlp copied to clipboard
Show modifications made by TurkishSentenceNormalizer
Hi,
The TurkishSentenceNormalizer.normalize(String string) method takes a string and returns the normalized string as a result. For my purposes, I run the tokenizer on the normalized string, but I need to know the original substring of each token from before the normalization. So it would be good if the normalize() method could, for example, return a mapping from each character of the normalized string to its substring in the original string.
For example:
tbrklr dimi
is normalized and then tokenized into [tebrikler], [değil], [mi]
so it would be good to know that the first token has its origin in the substring tbrklr
, the second in the substring dimi
and the third also in the substring dimi
(since there is a normalization step that splits the word dimi
into two tokens)
This functionality does not exist yet. Implementing this may not be trivial but I will see what I can do.
I will try to provide a pull request for this soon.
I tried to implement this in the PR #224 . Please have a look.
Thanks, I will have a look soon.
@mrmutator I have a couple of questions,
- Could you add some unit test so different use cases are easily visible (and it is always good to have tests)
- This implementation creates a pair of ints (a range) for each character in the output, I presume there would be a lot of repetitions for these ranges e.g. for your example all characters in [tebrikler] would be pointing to the same range, so maybe instead of per character, it should be per token based? Or maybe some kind of disjoint set structure would be of help?
- Could you pass your code through a formatter, we use Google format (explained here: https://github.com/ahmetaa/zemberek-nlp/wiki/Zemberek-For-Developers#changing-code-style)