universal-data-tool icon indicating copy to clipboard operation
universal-data-tool copied to clipboard

Problem with polish signs (letters) like ąśćęóżźł using named entity recognition interface

Open kiwimic opened this issue 5 years ago • 7 comments

Having problem with polish letters. universal-data-tool (web version and windows app) is splitting words if there is a polish sing (utf-8). And treath is as an distinct word. In example from .png photo "potrzebuję" and "różyczkę" are single words. udt

    "\U0104",        #Ą
    "\U0106",        #Ć
    "\U0118",        #Ę
    "\U0141",        #Ł
    "\U015A",        #Ś
    "\U0143",        #Ń
    "\U00D3",        #Ó
    "\U0179",        #Ź
    "\U017B",        #Ż
    "\U0105",        #ą
    "\U0107",        #ć
    "\U0119",        #ę
    "\U0142",        #ł
    "\U015B",        #ś
    "\U0144",        #ń
    "\U00F3",        #ó
    "\U017A",        #ź
    "\U017C"),      #ż,

kiwimic avatar Oct 23 '20 13:10 kiwimic

This should be easy to fix, there is a regex for splitting words in react-nlp-annotate (i think) We could even make it customizable.

Thanks for reporting. Do you have a sample string or regex for splitting? (so we can write a test?)

seveibar avatar Oct 23 '20 17:10 seveibar

This senstence as result should be splitted as spaces and commas

"Chrząszcz brzmi w trzcinie w Szczebrzeszynie, w szczękach chrząszcza trzeszczy miąższ. "

I looked for react-nlp-annotate and found this function with this regex. I don't have experience with JS so could not test this myself, but in R lang simple [\w] catches all polish letters (but also numeric values, so for words only i use \U0104 and etc

stringToSequence = (doc: string, sepRe: RegExp = /[a-zA-ZÀ-ÿ]+/g)

RegExp could be like [a-zA-ZÀ-ÿ \U0104\U0106\U0118\U0141\U015A\U0143\U00D3\U0179\U017B\U0105\U0107\U0119\U0142\U015B\U0144\U00F3\U017A\U017C]

This are utf-8 code for all polish special letters both lower and upper case.

kiwimic avatar Oct 23 '20 19:10 kiwimic

Great! We should have this fixed easily :)

The relevant file containing the Regex is: string-to-sequence.js. (the relevant snippet was pasted by @kiwimic above)

We'll need to be able to pass a regex as a prop into that library to do custom regexes, but as @kiwimic suggested we should be able to just paste in his regex codes and we'll automatically be working for polish.

The full process for getting this into the UDT would be...

  1. Open a PR to react-nlp-annotate adding the polish characters, upon merging it'll automatically publish a new npm module! You can test that it's working with yarn storybook and creating a story with polish characters by putting some text in one of the *.story.js files.
  2. Add the new react-nlp-annotate version to the UniversalDataTool with yarn add react-nlp-annotate and open a PR to this repo. The new UDT is published on merge!

I want to give a couple days for someone else to take a stab at this so to increase the :bus: factor!

seveibar avatar Oct 23 '20 21:10 seveibar

I've updated the text_entity_recognition specification to allow for a custom word splitting regex. Of course, that's out of scope for just fixing the polish signs, but relevant to the issue of testing and custom sequence splitting.

seveibar avatar Oct 23 '20 22:10 seveibar

See also: https://github.com/UniversalDataTool/universal-data-tool/issues/366#issuecomment-720637447

seveibar avatar Nov 02 '20 18:11 seveibar

@kiwimic this can now be fixed by putting [a-zA-ZÀ-ÿ\\u0104\\u0106\\u0118\\u0141\\u015A\\u0143\\u00D3\\u0179\\u017B\\u0105\\u0107\\u0119\\u0142\\u015B\\u0144\\u00F3\\u017A\\u017C]+ in the wordSplitRegex property of the dataset via #373

image

image

It's a bit annoying this needs to be done in the JSON setup, so I've created an issue to address getting it into the regular configuration: #374

seveibar avatar Nov 12 '20 17:11 seveibar

I think it's reasonable to add polish signs to the default of react nlp annotate as well, I'll reopen the issue to address that feature.

seveibar avatar Nov 12 '20 17:11 seveibar