SoMaJo
SoMaJo copied to clipboard
False Positives with URLS
I just wanted to make you aware that frameworks such as 'VB.NET' or 'ASP.Net' are considered URLs after tokenization and are thus not splitted (which is probably good). This is also the case for some abbriviations such as 'L/S/R' and SAP Versions such as R/3. Unfortunately this can't be prevented by adding them to 'single_token_abbreviations_de.txt' since they are checked after URLs. (R/3 is even included in 'single_token_abbreviations_de.txt').
The two names vb.net and asp.net are indeed working URLs (though only one is registered by Microsoft). While they are probably used much more frequently as proper names, recognizing them as URLs is technically correct. In either case, they should not be split.
L/S/R and R/3 puzzled me at first. The explanation is that they are recognized as Reddit links. Reddit links take the form "/r/subreddit" or "/u/user". The leading slash is often omitted and the German Reddit community also uses "l" instead of "r".
If the tokens class (URL vs. abbreviation) is important for your use case, you could either try to correct this in a postprocessing step, or, in the case of Reddit links, try to get rid of reddit_links
. Reddit links should only rarely occur outside Reddit posts, therefore a very quick'n'dirty hack would be:
tokenizer = SoMaJo("de_CMC")
tokenizer._tokenizer.reddit_links = re.compile(r"\s{10}")
When the regex for reddit_links
is applied, there are only single spaces in the text, i.e. the modified regex will never match.
Of course, a cleaner solution would be to either have an option for enabling/disabling the recognition of Reddit links or, even better, to have an option for user specified special cases that are processed relatively early.
You are obviously right concerning the first two. You might consider changing the regex so it no longer hits on 'r/l' or 'l/r' literally because in a technical context this often means "rechts/links" "links/rechts". ButI don't know how this would be handled in a competitive scenario.
I'm already doing a lot of preprocessing, by replacing substring that I don't want to split and reintroducing them afterwards. Pretty much like you did in the pre 2.0 versions.