liblouis icon indicating copy to clipboard operation
liblouis copied to clipboard

en-ueb-g2 back translate: standing alone, semicolon and preceding period

Open seally1186 opened this issue 5 years ago • 7 comments

Hi,

I believe I'm seeing two issues with back translation in UEB.

  • A following semicolon prevents wordsigns. For example, "⠏⠆" is back translated to "p;" rather than "people;".
  • A preceding period causes use of a wordsign. For example, "⠑⠲⠛⠲" is back translated to "e.go." rather than "e.g.".

In both cases, forward translation is correct.

seally1186 avatar Feb 06 '20 16:02 seally1186

Hello Michael. Thanks for your report. Yes, it is known that there are a lot of back-translation issues with UEB. However it's not something that I can fix. Best chance of getting improvements coming is by contacting the authors of the UEB table so that they become more aware of the urgency of these issues.

bertfrees avatar Feb 06 '20 16:02 bertfrees

Hi Bert. Thanks for you reply. Is the preferred method of contacting table maintainers to use the email addresses in the table file? (Sorry I'm not familiar with the workflow here so just made the issue.)

seally1186 avatar Feb 07 '20 22:02 seally1186

Indeed. In addition to the people mentioned in the table there is also James Bowden who has recently done work on the UEB table. You can find his details in the AUTHORS file.

bertfrees avatar Feb 09 '20 23:02 bertfrees

Hi Bert, Michael, I have constructed the attached test file, using a fairly minimal set of rules extracted from the UEB tables. On the surface there seems nothing wrong with the rules and I am unsure why certain rules in my test cause the bugs reported. Attached test is really YAML, but the system wants me to rename as TXT.

A third case where things are going wrong for backtranslation: when a letter is followed by a UEB close bracket (parenthesis) dots 5-345.

The rules which cause the problem are:

  1. sufword be 23
  2. always ar
  3. decpoint ? unsure

I do not know why these rules cause the problem, it could point to a bug in the engine.

Please do not hesitate to ask if you have any questions about the test file.

semicolon_test1.txt

jrbowden avatar Jun 11 '21 14:06 jrbowden

The attached YAML file with xfail added:
table: |
  # The following is a fairly minimal set of rules extracted from
  # en-ueb-g2.ctb, to demonstrate the problems of issue #892
  # Note the rules marked ~~~~ which seem to be the cause
  include unicode.dis
  include spaces.uti
  include latinLetterDef8Dots.uti
  
  # From en-ueb-chardefs.uti:
  punctuation ! 235
  nofor postpunc ! 235
  match %a ! %a 56-235
  
  punctuation " 6-2356
  # The "?" symbol is mostly handled below,
  # but the pattern needs to be defined before prepunc and postpunc can be used.
  nofor punctuation ? 236
  nofor punctuation " 356
  nofor prepunc " 236
  nofor postpunc " 356
  match %[^_~]%<* " %[_.$]*%[a#] 236
  match %[a#]%[_.$]* " %>*%[^_~] 356
  
  punctuation ' 3
  punctuation ( 5-126
  punctuation ) 5-345
  punctuation , 2
  match %a , %a 56-2
  punctuation - 36
  hyphen - 36
  punctuation . 256
  match %a . %# 256-34569  force correct position of numeric indicator
  noback pass2 @3456-256-34569 @256-3456    Clear up extra indicator after the match line
  decpoint . 256
  punctuation : 25
  postpunc : 25
  match %a : %a 56-25
  #TODO:  this is unnecessarily necessary
  punctuation ; 23
  
  # ~~~~ Unsure why the noback rule here is needed.
  noback punctuation ; 56
  
  match %a ; %a 56-23
  # requires grade one indicator when by itself
  punctuation ? 56-236
  postpunc ? 236
  punctuation [ 46-126
  punctuation ] 46-345
  punctuation { 456-126
  punctuation } 456-345
  
  punctuation \x2010 36 ‐
  punctuation \x2011 36 ‑
  noback punctuation \x2013 6-36 – backtranslate as \x2014
  punctuation \x2014 6-36 —   Rules of UEB, App.3
  punctuation \x2015 5-6-36 ―
  noback punctuation \x2018 6-236 ‘
  noback punctuation \x2019 6-356 ’
  match %a \x2019 %a 3 # single quote between letters is really  apostrophe
  punctuation \x201c 236 “
  punctuation \x201d 356 ”
  punctuation \x2026 256-256-256 …   ellipsis
  
  # from en-ueb-g1.ctb:
  capsletter 6
  begcapsword 6-6
  endcapsword 6-3
  lencapsphrase 3
  begcapsphrase 6-6-6
  endcapsphrase after 6-3

  # from en-ueb-g2.ctb:
  seqdelimiter -—
  seqdelimiter ‐   \x2010
  seqdelimiter ‑   \x2011
  seqdelimiter –   \x2013
  seqdelimiter —   \x2014
  seqdelimiter ―   \x2015
  
  seqbeforechars ([{"“'‘
  seqafterchars  )]}"”'’.,;:.!?…
  seqafterpattern 'd
  seqafterpattern 'll
  seqafterpattern 're
  seqafterpattern 's
  seqafterpattern 't
  seqafterpattern 've
  seqafterpattern ’d
  seqafterpattern ’ll
  seqafterpattern ’re
  seqafterpattern ’s
  seqafterpattern ’t
  seqafterpattern ’ve
  #TODO:  all caps words (see lou_translateString.c:inSequence()
  seqafterpattern 'D
  seqafterpattern 'LL
  seqafterpattern 'RE
  seqafterpattern 'S
  seqafterpattern 'T
  seqafterpattern 'VE
  seqafterexpression '([DSTdst]|ll|[rv]e|LL|[RV]E)
  seqafterpattern ’D
  seqafterpattern ’LL
  seqafterpattern ’RE
  seqafterpattern ’S
  seqafterpattern ’T
  seqafterpattern ’VE
  seqafterexpression ’([DSTdst]|ll|[rv]e|LL|[RV]E)
  
  match %[^_~]%<* as (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 1356
  match %[^_~]%<* but (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 12
  match %[^_~]%<* can (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 14
  match %[^_~]%<* do (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 145
  match %[^_~]%<* every (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 15
  match %[^_~]%<* from (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 124
  match %[^_~]%<* go (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 1245
  match %[^_~]%<* have (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 125
  match %[^_~]%<* it (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 1346
  match %[^_~]%<* just (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 245
  match %[^_~]%<* knowledge (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 13
  match %[^_~]%<* like (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 123
  match %[^_~]%<* more (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 134
  match %[^_~]%<* not (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 1345
  match %[^_~]%<* people (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 1234
  match %[^_~]%<* quite (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 12345
  match %[^_~]%<* rather (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 1235
  match %[^_~]%<* so (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 234
  match %[^_~]%<* that (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 2345
  match %[^_~]%<* us (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 136
  match %[^_~]%<* very (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 1236
  match %[^_~]%<* will (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 2456
  match %[^_~]%<* you (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?%>*%[^_~] 13456
  
  # ~~~~ issue #892: why do these rules fire in strings like e.g. ?
  nofor word as  1356
  nofor word but  12
  nofor word can  14
  nofor word do  145
  nofor word every  15
  nofor word from  124
  nofor word go  1245
  nofor word have  125
  nofor word it  1346
  nofor word just  245
  nofor word knowledge  13
  nofor word like  123
  nofor word more  134
  nofor word not  1345
  nofor word people  1234
  nofor word quite  12345
  nofor word rather  1235
  nofor word so  234
  nofor word that  2345
  nofor word us  136
  nofor word very  1236
  nofor word will  2456
  nofor word you  13456
  
  nofor word but's 12-3-234
  nofor word but’s 12-3-234
  nofor word can's 14-3-234
  nofor word can’s 14-3-234
  nofor word can't 14-3-2345
  nofor word can’t 14-3-2345
  nofor word do's 145-3-234
  nofor word do’s 145-3-234
  nofor word go's 1245-3-234
  nofor word go’s 1245-3-234
  nofor word have's 125-3-234
  nofor word have’s 125-3-234
  nofor word it'd 1346-3-145
  nofor word it’d 1346-3-145
  nofor word it'll 1346-3-123-123
  nofor word it’ll 1346-3-123-123
  nofor word it's 1346-3-234
  nofor word it’s 1346-3-234
  nofor word knowledge's 13-3-234
  nofor word knowledge’s 13-3-234
  nofor word like's 123-3-234
  nofor word like’s 123-3-234
  nofor word more's 134-3-234
  nofor word more’s 134-3-234
  nofor word people's 1234-3-234
  nofor word people’s 1234-3-234
  nofor word so's 234-3-234
  nofor word so’s 234-3-234
  nofor word that'd 2345-3-145
  nofor word that’d 2345-3-145
  nofor word that'll 2345-3-123-123
  nofor word that’ll 2345-3-123-123
  nofor word that're 2345-3-1235-15
  nofor word that’re 2345-3-1235-15
  nofor word that's 2345-3-234
  nofor word that’s 2345-3-234
  nofor word that've 2345-3-1236-15
  nofor word that’ve 2345-3-1236-15
  nofor word will's 2456-3-234
  nofor word will’s 2456-3-234
  nofor word will've 2456-3-1236-15
  nofor word will’ve 2456-3-1236-15
  nofor word you'd 13456-3-145
  nofor word you’d 13456-3-145
  nofor word you'll 13456-3-123-123
  nofor word you’ll 13456-3-123-123
  nofor word you're 13456-3-1235-15
  nofor word you’re 13456-3-1235-15
  nofor word you's 13456-3-234
  nofor word you’s 13456-3-234
  nofor word you've 13456-3-1236-15
  nofor word you’ve 13456-3-1236-15
  contraction b
  contraction c
  contraction d
  contraction e
  contraction f
  contraction g
  contraction h
  contraction j
  contraction k
  contraction l
  contraction m
  contraction n
  contraction p
  contraction q
  contraction r
  contraction s
  contraction t
  contraction u
  contraction v
  contraction w
  contraction x
  contraction y
  contraction z
  contraction B
  contraction C
  contraction D
  contraction E
  contraction F
  contraction G
  contraction H
  contraction J
  contraction K
  contraction L
  contraction M
  contraction N
  contraction P
  contraction Q
  contraction R
  contraction S
  contraction T
  contraction U
  contraction V
  contraction W
  contraction X
  contraction Y
  contraction Z
  
  # ~~~~ The following rule apparently causes the word rules to fail when followed by dot 5-345 (parenthesis)
  always ar 345
  
  match (%[^_~]%<*) ar (['’]([DSTdst]|ll|[rv]e|LL|[RV]E))?(%>*%[^_~]) =
  always gh 126
  match (%[^_~]%<*) gh ('([DSTdst]|ll|[rv]e|LL|[RV]E))?(%>*%[^_~]) =
  match %[^_]|%[^_~]%<*[([{] be %[^_]|[)}\\]]%>*%[^_~] 23
  empmatchafter match %[^_~]%<* be [Gg]![GSgs] 23                     beg*
  empmatchafter match %[^_~]%<* be [BFHJMOPQWXZbfhjmopqwxz] 23
  
  # ~~~~ The following rule causes word rules to fail when followed by dots 23 (semicolon)
  nofor sufword be 23
  
  # End of test table

flags: {testmode: bothDirections}
tests:
  - ['go', ⠛]
  - ['go,', ⠛⠂]
  - ['go;', ⠛⠆, xfail: "back-translation fails"]
  - ['go:', ⠛⠒]
  - ['go.', ⠛⠲]
  - ['go!', ⠛⠖]
  - ['go?', ⠛⠦]
  - ['("go,")', ⠐⠣⠦⠛⠂⠴⠐⠜, xfail: "back-translation fails"]
  - ['not', ⠝]
  - ['not,', ⠝⠂]
  - ['not;', ⠝⠆, xfail: "back-translation fails"]
  - ['not:', ⠝⠒]
  - ['not.', ⠝⠲]
  - ['not!', ⠝⠖]
  - ['not?', ⠝⠦]
  - ['("not,")', ⠐⠣⠦⠝⠂⠴⠐⠜, xfail: "back-translation fails"]
  - ['people', ⠏]
  - ['people,', ⠏⠂]
  - ['people;', ⠏⠆, xfail: "back-translation fails"] # Original test in #892
  - ['people:', ⠏⠒]
  - ['people.', ⠏⠲]
  - ['people!', ⠏⠖]
  - ['people?', ⠏⠦]
  - ['("people,")', ⠐⠣⠦⠏⠂⠴⠐⠜, xfail: "back-translation fails"]
  - ['be', ⠆]
  - ['(be', ⠐⠣⠆]
  - ['be)', ⠆⠐⠜]
  - ['beg', ⠃⠑⠛]
  - ['began', ⠆⠛⠁⠝]
  - ['C.S.', ⠠⠉⠲⠠⠎⠲, xfail: "back-translation fails"]
  - ['e.g.', ⠑⠲⠛⠲, xfail: "back-translation fails"] # original test in #892
  - ['i.e.', ⠊⠲⠑⠲, xfail: "back-translation fails"]
  - ['n.b.', ⠝⠲⠃⠲, xfail: "back-translation fails"]

bertfrees avatar Jun 11 '21 14:06 bertfrees

Here is another, more stripped down YAML file from James that aims to identify the bug(s) in Liblouis' back-translation code (whereas the previous YAML file is a more complete overview of back-translation issues in the UEB table):
table: |
  # The following is a minimal set of rules extracted from
  # en-ueb-g2.ctb, to demonstrate the problems of issue #892
  # Note the rules marked ~~~~ which seem to be the cause
  include unicode.dis
  include spaces.uti
  include latinLetterDef8Dots.uti
  
  punctuation . 256
  punctuation ) 5-345
  punctuation ; 23
  
  # from en-ueb-g1.ctb:
  capsletter 6
  begcapsword 6-6
  endcapsword 6-3
  lencapsphrase 3
  begcapsphrase 6-6-6
  endcapsphrase after 6-3

  # from en-ueb-g2.ctb:
  word every 15
  word go 1245
  word people  1234
  
  contraction e
  contraction g
  contraction p
  contraction E
  contraction G
  contraction P
  
  # ~~~~ The following rule apparently causes the word rules to fail when followed by dot 5-345 (parenthesis)
  always ar 345
  
  # ~~~~ The following rule causes word rules to fail when followed by dots 23 (semicolon)
  sufword be 23
  
  # End of test table

flags: {testmode: bothDirections}
tests:
  - ['people', ⠏]
  - ['people;', ⠏⠆, xfail: "back-translation fails"] # Original test in #892
  - ['people.', ⠏⠲]
  - ['people)', ⠏⠐⠜, xfail: "back-translation fails"]
  - ['e.g.', ⠑⠲⠛⠲, xfail: "back-translation fails"] # original test in #892

bertfrees avatar Jun 22 '21 16:06 bertfrees

Many thanks @bertfrees for the stripped down tests. I believe my PR #1509 fixes the first problem. The fix was to correctly define the semicolon; instead of the strange rules "this is unnecessarily necessary" which no-one seemed to know why it was there. Hopefully this PR fixes back translation of both wordsigns and shortforms followed by semicolon:

  - ['people;', ⠏⠆] # should now pass
  - ['paid;', ⠏⠙⠆] # should also now pass

However, the other issues point to problems in the engine. Please let me know if any further changes are needed in the table. The AR contraction is a correct definition in the table and should not be affecting back translation in these tests.

jrbowden avatar Feb 19 '24 15:02 jrbowden