universal-data-tool
universal-data-tool copied to clipboard
Problem with polish signs (letters) like ąśćęóżźł using named entity recognition interface
Having problem with polish letters. universal-data-tool (web version and windows app) is splitting words if there is a polish sing (utf-8). And treath is as an distinct word. In example from .png photo "potrzebuję" and "różyczkę" are single words.
"\U0104", #Ą
"\U0106", #Ć
"\U0118", #Ę
"\U0141", #Ł
"\U015A", #Ś
"\U0143", #Ń
"\U00D3", #Ó
"\U0179", #Ź
"\U017B", #Ż
"\U0105", #ą
"\U0107", #ć
"\U0119", #ę
"\U0142", #ł
"\U015B", #ś
"\U0144", #ń
"\U00F3", #ó
"\U017A", #ź
"\U017C"), #ż,
This should be easy to fix, there is a regex for splitting words in react-nlp-annotate (i think) We could even make it customizable.
Thanks for reporting. Do you have a sample string or regex for splitting? (so we can write a test?)
This senstence as result should be splitted as spaces and commas
"Chrząszcz brzmi w trzcinie w Szczebrzeszynie, w szczękach chrząszcza trzeszczy miąższ. "
I looked for react-nlp-annotate and found this function with this regex. I don't have experience with JS so could not test this myself, but in R lang simple [\w] catches all polish letters (but also numeric values, so for words only i use \U0104 and etc
stringToSequence = (doc: string, sepRe: RegExp = /[a-zA-ZÀ-ÿ]+/g)
RegExp could be like [a-zA-ZÀ-ÿ \U0104\U0106\U0118\U0141\U015A\U0143\U00D3\U0179\U017B\U0105\U0107\U0119\U0142\U015B\U0144\U00F3\U017A\U017C]
This are utf-8 code for all polish special letters both lower and upper case.
Great! We should have this fixed easily :)
The relevant file containing the Regex is: string-to-sequence.js. (the relevant snippet was pasted by @kiwimic above)
We'll need to be able to pass a regex as a prop into that library to do custom regexes, but as @kiwimic suggested we should be able to just paste in his regex codes and we'll automatically be working for polish.
The full process for getting this into the UDT would be...
- Open a PR to react-nlp-annotate adding the polish characters, upon merging it'll automatically publish a new npm module! You can test that it's working with
yarn storybookand creating a story with polish characters by putting some text in one of the*.story.jsfiles. - Add the new react-nlp-annotate version to the UniversalDataTool with
yarn add react-nlp-annotateand open a PR to this repo. The new UDT is published on merge!
I want to give a couple days for someone else to take a stab at this so to increase the :bus: factor!
I've updated the text_entity_recognition specification to allow for a custom word splitting regex. Of course, that's out of scope for just fixing the polish signs, but relevant to the issue of testing and custom sequence splitting.
See also: https://github.com/UniversalDataTool/universal-data-tool/issues/366#issuecomment-720637447
@kiwimic this can now be fixed by putting [a-zA-ZÀ-ÿ\\u0104\\u0106\\u0118\\u0141\\u015A\\u0143\\u00D3\\u0179\\u017B\\u0105\\u0107\\u0119\\u0142\\u015B\\u0144\\u00F3\\u017A\\u017C]+ in the wordSplitRegex property of the dataset via #373


It's a bit annoying this needs to be done in the JSON setup, so I've created an issue to address getting it into the regular configuration: #374
I think it's reasonable to add polish signs to the default of react nlp annotate as well, I'll reopen the issue to address that feature.