SoMaJo icon indicating copy to clipboard operation
SoMaJo copied to clipboard

False Positives with URLS

Open max-otto opened this issue 4 years ago • 2 comments

I just wanted to make you aware that frameworks such as 'VB.NET' or 'ASP.Net' are considered URLs after tokenization and are thus not splitted (which is probably good). This is also the case for some abbriviations such as 'L/S/R' and SAP Versions such as R/3. Unfortunately this can't be prevented by adding them to 'single_token_abbreviations_de.txt' since they are checked after URLs. (R/3 is even included in 'single_token_abbreviations_de.txt').

max-otto avatar Sep 01 '20 14:09 max-otto

The two names vb.net and asp.net are indeed working URLs (though only one is registered by Microsoft). While they are probably used much more frequently as proper names, recognizing them as URLs is technically correct. In either case, they should not be split.

L/S/R and R/3 puzzled me at first. The explanation is that they are recognized as Reddit links. Reddit links take the form "/r/subreddit" or "/u/user". The leading slash is often omitted and the German Reddit community also uses "l" instead of "r".

If the tokens class (URL vs. abbreviation) is important for your use case, you could either try to correct this in a postprocessing step, or, in the case of Reddit links, try to get rid of reddit_links. Reddit links should only rarely occur outside Reddit posts, therefore a very quick'n'dirty hack would be:

tokenizer = SoMaJo("de_CMC")
tokenizer._tokenizer.reddit_links = re.compile(r"\s{10}")

When the regex for reddit_links is applied, there are only single spaces in the text, i.e. the modified regex will never match.

Of course, a cleaner solution would be to either have an option for enabling/disabling the recognition of Reddit links or, even better, to have an option for user specified special cases that are processed relatively early.

tsproisl avatar Sep 09 '20 10:09 tsproisl

You are obviously right concerning the first two. You might consider changing the regex so it no longer hits on 'r/l' or 'l/r' literally because in a technical context this often means "rechts/links" "links/rechts". ButI don't know how this would be handled in a competitive scenario.

I'm already doing a lot of preprocessing, by replacing substring that I don't want to split and reintroducing them afterwards. Pretty much like you did in the pre 2.0 versions.

max-otto avatar Sep 17 '20 13:09 max-otto