SoMaJo icon indicating copy to clipboard operation
SoMaJo copied to clipboard

Tokenizer text recovery problem

Open shabie opened this issue 3 years ago • 1 comments

I am trying to recover the text but it is not possible since the token.original_spelling for a token : ( does not contain the original number of spaces.

Here is a motivating example:

import somajo
tokenizer = somajo.SoMaJo("de_CMC", split_camel_case=True, split_sentences=True)
paragraph = ["Angebotener Hersteller/Typ:   (vom Bieter einzutragen)  Im \
              Einheitspreis sind alle erforderlichen \
              Schutzmaßnahmen bei Errichtung des Brandschutzes einzukalkulieren."]
for sent in tokenizer.tokenize_text(paragraph):
    for token in sent:
        print(token, " --> ", token.original_spelling)

This prints

Angebotener  -->  None
Hersteller  -->  None
/  -->  None
Typ  -->  None
:(  -->  : (
vom  -->  None
Bieter  -->  None
einzutragen  -->  None
)  -->  None
Im  -->  None
Einheitspreis  -->  None
sind  -->  None
alle  -->  None
erforderlichen  -->  None
Schutzmaßnahmen  -->  None
bei  -->  None
Errichtung  -->  None
des  -->  None
Brandschutzes  -->  None
einzukalkulieren  -->  None
.  -->  None

It would be great if this could somehow be resolved. Thanks!

shabie avatar Mar 23 '21 21:03 shabie

It is currently not possible to perfectly reconstruct the input text from the output tokens as SoMaJo will normalize any whitespace to a single space and will discard things like control characters (see also issue #17).

How to best proceed from here depends on what you want to achieve. Do you want to be able to perfectly detokenize any text or do you want to address the particular tokenization error in your example, i.e. that colon and paren are erroneously merged into a single token? The former would require a lot more work than the latter.

Detokenization, alternative 1: SoMaJo could try to keep all the information that is necessary to reconstruct the original input. This might be feasible for whitespace. However, being able to do the same thing for some of the nasty characters that SoMaJo removes (control characters, soft hyphen, zero-width space, etc.) would require deeper changes.

Detokenization, alternative 2: You could solve the problem externally. The detokenize function from issue #17 almost solves the problem. It should be easy to capture the remaining differences between the detokenized text and the original input with some string alignment algorithm and to add the additional information to the tokens.

Addressing the tokenization error: Emoticons that contain an erroneous space should be quite rare. If you do not need to recognize them (for example because regular sequences of colon, space and paren are much more frequent in your data), you could try to deacticate that feature of the tokenizer. Unfortunately, there is no API for doing that, but a small hack can do the trick: You can set the regular expression that recognizes emoticons with a space to something that never matches, e.g. r"$^" (end of string followed by beginning of string). Here is how you could do that:

import somajo
import regex as re

tokenizer = somajo.SoMaJo("de_CMC", split_camel_case=True, split_sentences=True)
tokenizer._tokenizer.space_emoticon = re.compile(r"$^")
paragraph = ["Angebotener Hersteller/Typ:   (vom Bieter einzutragen)  Im \
              Einheitspreis sind alle erforderlichen \
              Schutzmaßnahmen bei Errichtung des Brandschutzes einzukalkulieren."]
for sent in tokenizer.tokenize_text(paragraph):
    for token in sent:
        print(token, " --> ", token.original_spelling)

And here is the output:

Angebotener  -->  None
Hersteller  -->  None
/  -->  None
Typ  -->  None
:  -->  None
(  -->  None
vom  -->  None
Bieter  -->  None
einzutragen  -->  None
)  -->  None
Im  -->  None
Einheitspreis  -->  None
sind  -->  None
alle  -->  None
erforderlichen  -->  None
Schutzmaßnahmen  -->  None
bei  -->  None
Errichtung  -->  None
des  -->  None
Brandschutzes  -->  None
einzukalkulieren  -->  None
.  -->  None

tsproisl avatar Mar 24 '21 19:03 tsproisl