zemberek-nlp icon indicating copy to clipboard operation
zemberek-nlp copied to clipboard

Show modifications made by TurkishSentenceNormalizer

Open mrmutator opened this issue 5 years ago • 5 comments

Hi,

The TurkishSentenceNormalizer.normalize(String string) method takes a string and returns the normalized string as a result. For my purposes, I run the tokenizer on the normalized string, but I need to know the original substring of each token from before the normalization. So it would be good if the normalize() method could, for example, return a mapping from each character of the normalized string to its substring in the original string.

For example:

tbrklr dimi is normalized and then tokenized into [tebrikler], [değil], [mi] so it would be good to know that the first token has its origin in the substring tbrklr, the second in the substring dimi and the third also in the substring dimi (since there is a normalization step that splits the word dimi into two tokens)

mrmutator avatar May 08 '19 11:05 mrmutator

This functionality does not exist yet. Implementing this may not be trivial but I will see what I can do.

ahmetaa avatar May 14 '19 11:05 ahmetaa

I will try to provide a pull request for this soon.

mrmutator avatar May 16 '19 13:05 mrmutator

I tried to implement this in the PR #224 . Please have a look.

mrmutator avatar Jun 11 '19 14:06 mrmutator

Thanks, I will have a look soon.

mdakin avatar Jun 13 '19 08:06 mdakin

@mrmutator I have a couple of questions,

  1. Could you add some unit test so different use cases are easily visible (and it is always good to have tests)
  2. This implementation creates a pair of ints (a range) for each character in the output, I presume there would be a lot of repetitions for these ranges e.g. for your example all characters in [tebrikler] would be pointing to the same range, so maybe instead of per character, it should be per token based? Or maybe some kind of disjoint set structure would be of help?
  3. Could you pass your code through a formatter, we use Google format (explained here: https://github.com/ahmetaa/zemberek-nlp/wiki/Zemberek-For-Developers#changing-code-style)

mdakin avatar Jun 14 '19 11:06 mdakin