LEMLAT3 Treatment of punctuation

Treatment of punctuation

Open Stormur opened this issue 6 years ago • 1 comments

I have noticed that punctuation marks apart from the hyphen - are not analyzed by LEMLAT, not even as unknown wordforms in the unk file (where "-" lands). However, when e.g. feeding a list of wordforms, one per line, to LEMLAT, it would be better to have the possibility to retrieve all of them in either of the two output files.

Also, LEMLAT automatically splits a string where a ' appears, creating two wordforms that are subsequently analyzed, without this being mentioned in the inline output message. there should be some option to change this behaviour and to make LEMLAT analyze each token as it is. Since this also happens with "." , it is very relevant for the treatment of abbreviations, which are very often tokenized as "T." or "F.", to distinguish them from the occurrences of the isolated letters "T" or "F".

Feb 20 '19 13:02 Stormur

LEMLAT3 LEMLAT3 copied to clipboard

Treatment of punctuation

LEMLAT3
LEMLAT3 copied to clipboard