liblouisutdml icon indicating copy to clipboard operation
liblouisutdml copied to clipboard

file2brl generating a wrong hyphenated word with hungarian eurobraille document

Open hammera opened this issue 5 years ago • 2 comments

Hi List,

In 2017 Norbert and me founded an interesting situation when using file2brl with following parameters: file2brl -f hu.cfg -t test.html test.brf If anybody would like trying reproducing or fix this issue, I attaching four files: test.htm: this is the small source html document, with I cutted the affected HTML part. test.brf: this is the wrong way generated hungarian grade1 braille document, with containing the 29TH line the wrong hungarian hyphenation part. hu.cfg: this file containing my hungarian language specific preferences for file2brl.

In Linux anybody succesfully reproduce this issue if copying the hu.cfg file into /usr/share/liblouisutdml/lbu_files directory, and type following command: file2brl -f hu.cfg -t test.htm test.brf

In the generated test.brf document 29TH line the file2brl utility wrong hyphenate the "bekezdés" word part. This situation the hyphen character lands in the 29TH line with 32TH character position.

With Liblouis I verifyed what parts possible hyphenate hungarian language the bekezdés word, following parts resulting good hyphenation: be-kez-dés Because the lou_checkhyphens utility impossible to test the bekezdés word because this word containing accented character, I wrote a small python script to easy test any words in hungarian language. The code is following: #!/usr/bin/env python3

-- coding: utf-8 --

import louis, sys def hyphenate_word(word): try: hyphen_mask=louis.hyphenate(['hu-hu-g1.ctb', 'hyph_hu_HU.dic'], word, 0) temp="".join( list(map(lambda a,b: "-"+a if b=='1' else a, word, hyphen_mask))) hyphenated_word=temp except RuntimeError: slice=word.split('-') temp_hyphenated_word='' for l in slice: hyphen_mask=louis.hyphenate(['hu-hu-g1.ctb', 'hyph_hu_HU.dic'], l, 0) temp="".join( list(map(lambda a,b: "-"+a if b=='1' else a, l, hyphen_mask)))+'-' temp_hyphenated_word=temp_hyphenated_word+temp hyphenated_word=hyphenated_word[0:len(hyphenated_word)-1] return hyphenated_word

word=sys.argv[1] hyphenated_word=hyphenate_word(word) print('normal word: '+word) print('hyphenated word: '+hyphenated_word)

If I run python3 hyphenate.py bekezdés command, I get following right output: "normal word: bekezdés hyphenated word: be-kez-dés" I attaching this small test program too.

Liblouis builtin hyphenate function confirming me the generated beke- hyphenation part is not valid. In the 29TH line the first right hyphenate part with fit the maximum 32 character line length is "be-", and need putting the next line the "kezdés" word part. The affected text part right braille output after manual correction is following in eurobraille format in hungarian grade1 braille: "5qveg. $vajon e2 beh02"sos be- ke2d1s le5-e?"

How can possible preventing this situation with automatic braille conversion? How can possible for example backlisting this wrong hyphenation if Liblouis part generating good hyphenation masks this word? Small texts easy correcting this type errors, but a large document when the purpose is a printable braille book, It is a very tedious task with document corrector persons. Have big chance a large text possible happening more this type issues.

I attaching the affected files. Attila

hammera avatar Aug 24 '18 14:08 hammera