folia icon indicating copy to clipboard operation
folia copied to clipboard

Python issues: Splitting long text by folia2txt and FLAT in the custom software

Open osherenko opened this issue 3 years ago • 1 comments

  1. I've installed folia-utils and used the "folia2txt -s ..." from CLI to split a long string in sentences. Unfortunately, if I split the old Slavonic text "Искони бе Слово и Слово бе отъ Бога. и Богъ бе слово." in sentences I get the wrong answer Искони бе Слово и Слово бе отъ Бога. и Богъ бе слово. If I split an English text, it works just fine. 
  2. Is it possible to run FLAT not as a tab in an internet browser, but as a PySide widget? BTW, I can't import folia2html from the foliatools package in my Python script as I did with foliatools.folia2txt, foliatools.foliafreqlist, foliatools.foliatree. Nevertheless, I can run it from the CLI by "python.exe foliatools\folia2txt.py -s myannotation.xml"

osherenko avatar Nov 15 '22 14:11 osherenko

  1. I've installed folia-utils and used the "folia2txt -s ..." from CLI to split a long string in sentences.

folia2txt -s is not a proper sentence splitter, it simply assumes each line of a text file is already its own sentence!

For an actual tokeniser and sentence splitter with rich FoLiA support, consider ucto: https://github.com/LanguageMachines/ucto Although it has no specific rules for Old Church Slavonic, but you can use the generic ruleset (named generic) or the russian one tokconfig-rus).

  1. Is it possible to run FLAT not as a tab in an internet browser, but as a PySide widget?

I hadn't heard of these until now so I don't know. I suppose if there's such a qt widget which holds a whole web browser, then yes.

BTW, I can't import folia2html from the foliatools package in my Python script as I did with foliatools.folia2txt, foliatools.foliafreqlist, foliatools.foliatree. Nevertheless, I can run it from the CLI by "python.exe foliatools\folia2txt.py -s myannotation.xml"

Hmm.. I see.. that should be probably be improved yes.

proycon avatar Nov 15 '22 14:11 proycon