liblouis
liblouis copied to clipboard
en-ueb-g2 back translate: standing alone, semicolon and preceding period
Hi,
I believe I'm seeing two issues with back translation in UEB.
- A following semicolon prevents wordsigns. For example, "⠏⠆" is back translated to "p;" rather than "people;".
- A preceding period causes use of a wordsign. For example, "⠑⠲⠛⠲" is back translated to "e.go." rather than "e.g.".
In both cases, forward translation is correct.
Hello Michael. Thanks for your report. Yes, it is known that there are a lot of back-translation issues with UEB. However it's not something that I can fix. Best chance of getting improvements coming is by contacting the authors of the UEB table so that they become more aware of the urgency of these issues.
Hi Bert. Thanks for you reply. Is the preferred method of contacting table maintainers to use the email addresses in the table file? (Sorry I'm not familiar with the workflow here so just made the issue.)
Indeed. In addition to the people mentioned in the table there is also James Bowden who has recently done work on the UEB table. You can find his details in the AUTHORS file.
Hi Bert, Michael, I have constructed the attached test file, using a fairly minimal set of rules extracted from the UEB tables. On the surface there seems nothing wrong with the rules and I am unsure why certain rules in my test cause the bugs reported. Attached test is really YAML, but the system wants me to rename as TXT.
A third case where things are going wrong for backtranslation: when a letter is followed by a UEB close bracket (parenthesis) dots 5-345.
The rules which cause the problem are:
- sufword be 23
- always ar
- decpoint ? unsure
I do not know why these rules cause the problem, it could point to a bug in the engine.
Please do not hesitate to ask if you have any questions about the test file.
The attached YAML file with xfail
added:
table: |
# The following is a fairly minimal set of rules extracted from
# en-ueb-g2.ctb, to demonstrate the problems of issue #892
# Note the rules marked ~~~~ which seem to be the cause
include unicode.dis
include spaces.uti
include latinLetterDef8Dots.uti
# From en-ueb-chardefs.uti:
punctuation ! 235
nofor postpunc ! 235
match %a ! %a 56-235
punctuation " 6-2356
# The "?" symbol is mostly handled below,
# but the pattern needs to be defined before prepunc and postpunc can be used.
nofor punctuation ? 236
nofor punctuation " 356
nofor prepunc " 236
nofor postpunc " 356
match %[^_~]%<* " %[_.$]*%[a#] 236
match %[a#]%[_.$]* " %>*%[^_~] 356
punctuation ' 3
punctuation ( 5-126
punctuation ) 5-345
punctuation , 2
match %a , %a 56-2
punctuation - 36
hyphen - 36
punctuation . 256
match %a . %# 256-34569 force correct position of numeric indicator
noback pass2 @3456-256-34569 @256-3456 Clear up extra indicator after the match line
decpoint . 256
punctuation : 25
postpunc : 25
match %a : %a 56-25
#TODO: this is unnecessarily necessary
punctuation ; 23
# ~~~~ Unsure why the noback rule here is needed.
noback punctuation ; 56
match %a ; %a 56-23
# requires grade one indicator when by itself
punctuation ? 56-236
postpunc ? 236
punctuation [ 46-126
punctuation ] 46-345
punctuation { 456-126
punctuation } 456-345
punctuation \x2010 36 ‐
punctuation \x2011 36 ‑
noback punctuation \x2013 6-36 – backtranslate as \x2014
punctuation \x2014 6-36 — Rules of UEB, App.3
punctuation \x2015 5-6-36 ―
noback punctuation \x2018 6-236 ‘
noback punctuation \x2019 6-356 ’
match %a \x2019 %a 3 # single quote between letters is really apostrophe
punctuation \x201c 236 “
punctuation \x201d 356 ”
punctuation \x2026 256-256-256 … ellipsis
# from en-ueb-g1.ctb:
capsletter 6
begcapsword 6-6
endcapsword 6-3
lencapsphrase 3
begcapsphrase 6-6-6
endcapsphrase after 6-3
# from en-ueb-g2.ctb:
seqdelimiter -—
seqdelimiter ‐ \x2010
seqdelimiter ‑ \x2011
seqdelimiter – \x2013
seqdelimiter — \x2014
seqdelimiter ― \x2015
seqbeforechars ([{"“'‘
seqafterchars )]}"”'’.,;:.!?…
seqafterpattern 'd
seqafterpattern 'll
seqafterpattern 're
seqafterpattern 's
seqafterpattern 't
seqafterpattern 've
seqafterpattern ’d
seqafterpattern ’ll
seqafterpattern ’re
seqafterpattern ’s
seqafterpattern ’t
seqafterpattern ’ve
#TODO: all caps words (see lou_translateString.c:inSequence()
seqafterpattern 'D
seqafterpattern 'LL
seqafterpattern 'RE
seqafterpattern 'S
seqafterpattern 'T
seqafterpattern 'VE
seqafterexpression '([DSTdst]|ll|[rv]e|LL|[RV]E)
seqafterpattern ’D
seqafterpattern ’LL
seqafterpattern ’RE
seqafterpattern ’S
seqafterpattern ’T
seqafterpattern ’VE
seqafterexpression ’([DSTdst]|ll|[rv]e|LL|[RV]E)
match %[^_~]%<* as (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 1356
match %[^_~]%<* but (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 12
match %[^_~]%<* can (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 14
match %[^_~]%<* do (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 145
match %[^_~]%<* every (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 15
match %[^_~]%<* from (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 124
match %[^_~]%<* go (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 1245
match %[^_~]%<* have (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 125
match %[^_~]%<* it (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 1346
match %[^_~]%<* just (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 245
match %[^_~]%<* knowledge (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 13
match %[^_~]%<* like (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 123
match %[^_~]%<* more (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 134
match %[^_~]%<* not (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 1345
match %[^_~]%<* people (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 1234
match %[^_~]%<* quite (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 12345
match %[^_~]%<* rather (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 1235
match %[^_~]%<* so (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 234
match %[^_~]%<* that (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 2345
match %[^_~]%<* us (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 136
match %[^_~]%<* very (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 1236
match %[^_~]%<* will (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 2456
match %[^_~]%<* you (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 13456
# ~~~~ issue #892: why do these rules fire in strings like e.g. ?
nofor word as 1356
nofor word but 12
nofor word can 14
nofor word do 145
nofor word every 15
nofor word from 124
nofor word go 1245
nofor word have 125
nofor word it 1346
nofor word just 245
nofor word knowledge 13
nofor word like 123
nofor word more 134
nofor word not 1345
nofor word people 1234
nofor word quite 12345
nofor word rather 1235
nofor word so 234
nofor word that 2345
nofor word us 136
nofor word very 1236
nofor word will 2456
nofor word you 13456
nofor word but's 12-3-234
nofor word but’s 12-3-234
nofor word can's 14-3-234
nofor word can’s 14-3-234
nofor word can't 14-3-2345
nofor word can’t 14-3-2345
nofor word do's 145-3-234
nofor word do’s 145-3-234
nofor word go's 1245-3-234
nofor word go’s 1245-3-234
nofor word have's 125-3-234
nofor word have’s 125-3-234
nofor word it'd 1346-3-145
nofor word it’d 1346-3-145
nofor word it'll 1346-3-123-123
nofor word it’ll 1346-3-123-123
nofor word it's 1346-3-234
nofor word it’s 1346-3-234
nofor word knowledge's 13-3-234
nofor word knowledge’s 13-3-234
nofor word like's 123-3-234
nofor word like’s 123-3-234
nofor word more's 134-3-234
nofor word more’s 134-3-234
nofor word people's 1234-3-234
nofor word people’s 1234-3-234
nofor word so's 234-3-234
nofor word so’s 234-3-234
nofor word that'd 2345-3-145
nofor word that’d 2345-3-145
nofor word that'll 2345-3-123-123
nofor word that’ll 2345-3-123-123
nofor word that're 2345-3-1235-15
nofor word that’re 2345-3-1235-15
nofor word that's 2345-3-234
nofor word that’s 2345-3-234
nofor word that've 2345-3-1236-15
nofor word that’ve 2345-3-1236-15
nofor word will's 2456-3-234
nofor word will’s 2456-3-234
nofor word will've 2456-3-1236-15
nofor word will’ve 2456-3-1236-15
nofor word you'd 13456-3-145
nofor word you’d 13456-3-145
nofor word you'll 13456-3-123-123
nofor word you’ll 13456-3-123-123
nofor word you're 13456-3-1235-15
nofor word you’re 13456-3-1235-15
nofor word you's 13456-3-234
nofor word you’s 13456-3-234
nofor word you've 13456-3-1236-15
nofor word you’ve 13456-3-1236-15
contraction b
contraction c
contraction d
contraction e
contraction f
contraction g
contraction h
contraction j
contraction k
contraction l
contraction m
contraction n
contraction p
contraction q
contraction r
contraction s
contraction t
contraction u
contraction v
contraction w
contraction x
contraction y
contraction z
contraction B
contraction C
contraction D
contraction E
contraction F
contraction G
contraction H
contraction J
contraction K
contraction L
contraction M
contraction N
contraction P
contraction Q
contraction R
contraction S
contraction T
contraction U
contraction V
contraction W
contraction X
contraction Y
contraction Z
# ~~~~ The following rule apparently causes the word rules to fail when followed by dot 5-345 (parenthesis)
always ar 345
match (%[^_~]%<*) ar (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?(%>*%[^_~]) =
always gh 126
match (%[^_~]%<*) gh ('([DSTdst]|ll|[rv]e|LL|[RV]E))?(%>*%[^_~]) =
match %[^_]|%[^_~]%<*[([{] be %[^_]|[)}\\]]%>*%[^_~] 23
empmatchafter match %[^_~]%<* be [Gg]![GSgs] 23 beg*
empmatchafter match %[^_~]%<* be [BFHJMOPQWXZbfhjmopqwxz] 23
# ~~~~ The following rule causes word rules to fail when followed by dots 23 (semicolon)
nofor sufword be 23
# End of test table
flags: {testmode: bothDirections}
tests:
- ['go', ⠛]
- ['go,', ⠛⠂]
- ['go;', ⠛⠆, xfail: "back-translation fails"]
- ['go:', ⠛⠒]
- ['go.', ⠛⠲]
- ['go!', ⠛⠖]
- ['go?', ⠛⠦]
- ['("go,")', ⠐⠣⠦⠛⠂⠴⠐⠜, xfail: "back-translation fails"]
- ['not', ⠝]
- ['not,', ⠝⠂]
- ['not;', ⠝⠆, xfail: "back-translation fails"]
- ['not:', ⠝⠒]
- ['not.', ⠝⠲]
- ['not!', ⠝⠖]
- ['not?', ⠝⠦]
- ['("not,")', ⠐⠣⠦⠝⠂⠴⠐⠜, xfail: "back-translation fails"]
- ['people', ⠏]
- ['people,', ⠏⠂]
- ['people;', ⠏⠆, xfail: "back-translation fails"] # Original test in #892
- ['people:', ⠏⠒]
- ['people.', ⠏⠲]
- ['people!', ⠏⠖]
- ['people?', ⠏⠦]
- ['("people,")', ⠐⠣⠦⠏⠂⠴⠐⠜, xfail: "back-translation fails"]
- ['be', ⠆]
- ['(be', ⠐⠣⠆]
- ['be)', ⠆⠐⠜]
- ['beg', ⠃⠑⠛]
- ['began', ⠆⠛⠁⠝]
- ['C.S.', ⠠⠉⠲⠠⠎⠲, xfail: "back-translation fails"]
- ['e.g.', ⠑⠲⠛⠲, xfail: "back-translation fails"] # original test in #892
- ['i.e.', ⠊⠲⠑⠲, xfail: "back-translation fails"]
- ['n.b.', ⠝⠲⠃⠲, xfail: "back-translation fails"]
Here is another, more stripped down YAML file from James that aims to identify the bug(s) in Liblouis' back-translation code (whereas the previous YAML file is a more complete overview of back-translation issues in the UEB table):
table: |
# The following is a minimal set of rules extracted from
# en-ueb-g2.ctb, to demonstrate the problems of issue #892
# Note the rules marked ~~~~ which seem to be the cause
include unicode.dis
include spaces.uti
include latinLetterDef8Dots.uti
punctuation . 256
punctuation ) 5-345
punctuation ; 23
# from en-ueb-g1.ctb:
capsletter 6
begcapsword 6-6
endcapsword 6-3
lencapsphrase 3
begcapsphrase 6-6-6
endcapsphrase after 6-3
# from en-ueb-g2.ctb:
word every 15
word go 1245
word people 1234
contraction e
contraction g
contraction p
contraction E
contraction G
contraction P
# ~~~~ The following rule apparently causes the word rules to fail when followed by dot 5-345 (parenthesis)
always ar 345
# ~~~~ The following rule causes word rules to fail when followed by dots 23 (semicolon)
sufword be 23
# End of test table
flags: {testmode: bothDirections}
tests:
- ['people', ⠏]
- ['people;', ⠏⠆, xfail: "back-translation fails"] # Original test in #892
- ['people.', ⠏⠲]
- ['people)', ⠏⠐⠜, xfail: "back-translation fails"]
- ['e.g.', ⠑⠲⠛⠲, xfail: "back-translation fails"] # original test in #892
Many thanks @bertfrees for the stripped down tests. I believe my PR #1509 fixes the first problem. The fix was to correctly define the semicolon; instead of the strange rules "this is unnecessarily necessary" which no-one seemed to know why it was there. Hopefully this PR fixes back translation of both wordsigns and shortforms followed by semicolon:
- ['people;', ⠏⠆] # should now pass
- ['paid;', ⠏⠙⠆] # should also now pass
However, the other issues point to problems in the engine. Please let me know if any further changes are needed in the table. The AR contraction is a correct definition in the table and should not be affecting back translation in these tests.