liblouis New "exception" opcode for implementing exceptions to contractions without making overly long rules

After having declared the basic contraction rules, many grade 2 tables end up the following way:

always exception1 pattern
always exception2 pattern2
always exceptionToException1 longPattern
always exceptionToTheExceptionToException1 longPattern
always veryLongChunk veryLongPattern
always evenLongerChunk evenLongerPattern
...

This approach has the following two disadvantages:

overusing the "always" opcode and moderating with other "always" opcodes can break back-translation. Rules will be triggered by dot patterns, whether they are appropriate or not.
Cursor movement will work less than optimal when long chunks are used. Rules are atomic, and the cursor can only move from the beginning of one applied rule to the beginning of the next one. When the rules consist of long chunks, the cursor movement will mirror the internal structure of the table rather than the cursor position in the text.

Ideally, I think the word/begword/partword/always/etc. opcodes should be used only to define the rules as laid down by the Braille authority. Exceptions should be maintained by some other means.

One option is to use hyphenation. In the Danish tables, I have now successfully used hyphenation to eliminate all nocross chunks except for the basic rules (PR coming up). It makes for a very clean implementation of the table and all the exceptions spelled out in the hyphenation file. You don't have to worry about the logic of the exceptions, which is all taken care of by patgen.

If one does not feel comfortable using the hyphenation approach, but would rather use a more classical exception list approach, we could make a new exception opcode or add an extra feature to the correct opcode to insert an invisible hyphen at pass 0.

Then we could have a table like the following:

partword of 12356
noback correct "profes" "pro_fes"
# or simply:
exception pro_fes

It would not affect the cursor position after translation, and if "always" is not overused, back-translation has a better chance of succeeding.

Jun 21 '18 18:06 BueVest

This would of course be a very big improvement.

But first I think this is going to require very good tests if we or somebody else who doesn't know the braille code wants to do such a refactoring.

And second, who is willing to do the work? If we can't find anyone to do it, does it make sense to keep this issue? Should we start actively encouraging table authors to rewrite their tables? I do like clean tables, and I make sure that new contributions are done well. But I can't be bothered that much that I would consider doing this kind of work myself, or that I would trace down the authors of all the contracted tables and write to them.

Regarding the idea of the exception opcode. @BueVest could you elaborate a bit more on it? If it's something that can already be done using correct I would prefer not creating a new opcode. If it can't be done yet we should open a new issue.

Aug 14 '19 18:08 bertfrees

Yes, I see your point. It could probably be done by using the "correct" opcode to insert a dummy unicode character and then a "pass" to remove the resulting Braille pattern (see example above). However, this approach would probably not seem very intuitive to none-programmers. That is why I suggested an "exception" opcode. It could even be a macro which used the technique mentioned above. Perhaps, we could make a recommendation in the documentation and let it be it for this issue.

Sep 07 '19 12:09 BueVest

Yes, documentation is always a good start. This is also related to the documentation for lou_maketable which is also still missing: https://github.com/liblouis/liblouis/issues/404

Sep 10 '19 17:09 bertfrees

Perhaps then, we should wait until that has been fixed. Then, it is easier to refer to it when explaining better ways to make complicated grade 2 tables.

No, I don’t imagine anyone going through all the grade 2 tables that others wrote and rewrite them with a new technique. It would be impossible for anyone who isn’t fluent in the language and Braille code, unless the test material is very extensive.

Sep 11 '19 20:09 BueVest

Should we close this issue and then perhaps open another issue about creating an "exception" opcode as an alternative to the hyphenation approach for those who want to write their tables in a more classic way?

Dec 09 '23 17:12 BueVest

Please go ahead and change the title of this issue. I think it makes sense to have such an opcode. We just need to think some more about the exact syntax and the name.

Dec 09 '23 22:12 bertfrees

Here are my thoughts on such an opcode:

exception letters <separator> letters

Tells Liblouis not to contract across the separator sign.

e.g.

always of 1-2-3-5-6
exception pro_fes

I would suggest the separator to be either _ or |.

The rest of the exception string should be only letters, not punctuation or digits etc.

It could be applied during pass 0 as a correct opcode that inserts an invisible character in the string to be translated without creating actual new word boundaries, as this might introduce incorrectly used word contractions.

It might be advantageous to be able to specify that the exception can only occur in the beginning or at the end of a word or maybe only as a whole word. So perhaps it should be opcodes like wordex partwordex alwaysex etc. However, I am not sure how important this last bit might be in different languages.

Hope it makes sense.

Dec 09 '23 23:12 BueVest

I gave the exception opcode a thought last night. Two things:

Are we sure that this can replace hyphenation patterns? Hyphenation patterns are combined in such a way that they can also "cancel" each other out. What if we need exceptions to exception rules?
I can also see the possibility of using plain old translation rules to implement exceptions, without the need for any new features, and without the need to inhibit rules with temporary characters. My approach applied to your example:
```
partword of 12356
noback context __pr[of]es *
```

Dec 10 '23 18:12 bertfrees

Yes, for this mechanism to work, the exceptions would have to be applied according to length (longest first). That should catch the exceptions to the exceptions to the exc…

In that way, we should be able to do the same as with hyphenation, even though I think the hyphenation approach is more flexible and orderly.

Yes, It could all be done with the existing partword rules, like it is done at present in most g2 tables, but that was exactly my point. I.e. to separate the actual rules from all manners of exceptions based on compound words etc. Also, rules are atomic where cursor placement is concerned, so long rules make for very inaccurate placement of the cursor on Braille displays or note-takers.

Perhaps it is not worth spending a lot of resources on this, unless someone intends to implement a new g2 table using this technique or clean up some old ones. For my part, I am more than happy using the hyphenation approach like I did from the start.

Dec 10 '23 19:12 BueVest

Yes, It could all be done with the existing partword rules, like it is done at present in most g2 tables, but that was exactly my point. I.e. to separate the actual rules from all manners of exceptions based on compound words etc. Also, rules are atomic where cursor placement is concerned, so long rules make for very inaccurate placement of the cursor on Braille displays or note-takers.

Not sure if you understood my suggested approach. It would not make for long rules. It's not something that tables use today to my knowledge.

Dec 10 '23 23:12 bertfrees

Yes, the current tables don’t use the context rule for this, but what would be the advantage compared with the more traditional rules? How would the context approach make for shorter and more precise rules? AFAIK, the context rules are applied together with all other rules in the same pass, which increases the risk of rule collisions. Hyphenation points, on the other hand, are inserted before the actual contraction. If we implemented an exception opcode, I would suggest this to also be applied before the actual contraction rules.

Using the current opcodes, maybe one could use the “correct” opcode something like this:

letter \xffff f # separator defined as letter, so no word boundary
always of 12356
noback correct pr[of]es o\xfffff

In pass2, @f would be removed.

This example is really simple, but with a full set of contraction rules and rules about compound words and syllables, it becomes rather complicated. Many of those g2 tables currently contain several thousand lines.

Dec 11 '23 15:12 BueVest

How would the context approach make for shorter and more precise rules?

Because with context rules, contrary to other pass 1 translation rules, you have the ability to look forward and back. For every rule you can have a number of corresponding exception rules with more context before and after the characters to replace. There's no magic, no need for multiple passes, or hacks that insert temporary characters.

Dec 11 '23 16:12 bertfrees