SwiftTerm icon indicating copy to clipboard operation
SwiftTerm copied to clipboard

EscapeParser issue with utf8

Open migueldeicaza opened this issue 5 years ago • 2 comments

The issue that is causing trouble are the 8-bit sequnces that can be sent, which conflict with UTF8, in particular the DCS initiator (0x90). Previously the code was consuming this sequence when on 8 bit mode:

            if currentState == .Ground && code > 0x1f
           // Gather print data

So this was consuming the 0x90 as print data. So the variation became:

if currentState == .Ground && (code > 0x1f && code < 0x80 || (code > 0xc2 && code < 0xf3)) {

But this ends up eating UTF8 sequences (see mc running in this mode)

migueldeicaza avatar Apr 04 '20 03:04 migueldeicaza

Well you have basically two (and a half) options here:

  • implement fully ISO-2022 compliance This is how terminals were meant to treat 7/8-bit data streams. Doing this gives you a great level of compatibility to older systems. But this is error-prone in the C0/C1 and G0/G1/G2/G3 state handling. For full support you basically have to be able to change the parser transition table on the fly (like unmapping C1 area and such, or re-declare it as printable for certain character sets). TL;DR - not the way to go these days, ISO-2022 is basically dead.
  • go as Unicode/UTF-8 only emulator Any data arriving at the parser, is meant to map correctly on the Unicode codepoint. The parser itself only needs to account up to \xA0, thus UTF-16 and UTF-32 work out of the box. For UTF-8 you have to decode on the fly or prehand, otherwise it will confuse C1 and multibyte characters. Thats the preferred way to start with these days.
  • middle ground between ISO-2022 and Unicode (thats what most emulators do). Some character sets of ISO-2022 are still used (like the graphic supplement from DEC), implement that in the terminal. UTF-8 is actually not further specced in ISO-2022 regarding the compliance level. Here every modern emulator would switch into "full UTF-8" - the idea is to treat it as stream encoding (thus a "naked" C1 symbol must not occur anymore, instead it has to be encoded properly as 2 byte character). Switching to another charset in ISO-2022 realms means, to change back into 7/8-bit mode (depending on the sequence initiating it), again allowing single byte C1 codes. This can be achieved w'o parser changes by applying the charset replacements either beforehand on the full stream, or later on the interesting data portions in the terminal functions (the latter is technically not 100% ISO-2022 compatible anymore, but should work due to the stronger "whole stream should be UTF-8" rule). Thats where the confusion starts, and most producers get it wrong with C1 codes. Thus a rule of thumb - never use C1 as 8-bit variant in an env, thats otherwise mainly Unicode/UTF-8.

Imho fully implementing ISO-2022 is a waste of time these days, most programs have adopted to the stream encoding rule of Unicode. If a certain program refuses to work - use luit as transcoder in between.

jerch avatar Apr 04 '20 15:04 jerch

Thank you for the detailed description! I am going to take another stab at this.

migueldeicaza avatar Apr 12 '20 04:04 migueldeicaza