medley
medley copied to clipboard
TEDIT: characters, bytes, and NS byte encoding
I've started my foray into the innards of Tedit, with the objectives of finding and fixing the bugs we have observed, rationalizing its interaction with different character encodings, and providing a consistent programmatic interface for applications (like LFG) that use the basic character primitives to interpret Tedit streams like streams on any other file.
Tedit was constructed when the NS character encoding was used as the standard not only for associating glyphs with 16 bit codes but also for representing 16 bit codes in byte sequences on files. The code we have now reflects that evolution, in that there isn't a clean separation between implementation levels that deal in bytes and byte sequences and levels that deal with characters. The code is complex also because it embodies an enormous number of careful optimizations that may have little value in our current configuration.
This confusion is mostly invisible (modulo bugs) at the editing user-interface level, but it becomes apparent when plain-text files are to be read or written. And it becomes visible also in the programmatic interface: it is not entirely clear how the position arguments and values of functions like GETFILEPTR, SETFILEPTR, GETOFPTR are interpreted, whether they are counting bytes in an underlying plaintext file, in some cases, or characters in Tedit files in other cases. And some times BIN will return an NS character-shift byte (255), and READDCODE will also sometimes do this, if the stream is positioned at the wrong place.
I think this needs to be cleaned up and made consistent, hopefully with the result that the code becomes simpler and more modular. My initial proposal (comments please) is that the "bytes" of Tedit streams are defined to be 16-bit character codes (with an occasional imageobj from time to time). Essentially that BIN and \INCCODE always return the same things. And that all of the positional functions count out in characters (SETFILEPTR xxx 25) means that the next BIN or \INCCODE will return the 25th 16-bit code in the stream's enumeration sequence. This might correspond to a byte at some arbitrarily later position in an underlying file stream, but that is a purely internal fact that will not be exposed to a caller.
Similarly, GETOFPTR of a Tedit stream will return the number of characters in the stream, not the number of bytes that those characters might occupy in any particular byte-sequence representation.
This may introduce a performance delay when Tedit is opened on a plain text file. In principle, the whole file has to be scanned to figure out how many characters there are and where they are located in the byte sequence, perhaps copying the whole file into an in-core cache. Opening a very large file might take some time and use of a bit of space. But that expansion can also be done incrementally: only cache the characters that are visible initially or through scrolling, or that must be decoded to set to get to some explicit character position (SETFILEPTR to a high value) or to find out the total lenght (GETEOFPTR). Those characters would be recoded into byte-sequences according to the original file's external format when the file is saved.
In sum: BIN and INCCODE would return the same 16-bit values for Text streams, position functions would count out in characters, not bytes. There will be no direct way of figuring out from a character position what its corresponding byte position might be in a backing file, other than by counting out from the beginning.
Tedit streams will use the domestic mapping of codes to glyphs (XCCS, until we get better Unicode fonts), and Tedit binary files will store those character codes as it does now. Plaintext files will map those domestic codes according to their external formats.
This sounds good in principle. I worry some about how this will impact the internal TEDIT operations for recording formatting information. I suspect that there's not a clean separation between character and byte counting operations on the underlying file stream and a TEdit (text) stream.
I'm not a fan of having to copy (or process) the entire plain-text file before working on it. I think much of the work could be done incrementally, as required, and the results cached in, say, a skip list to map between character and byte positions (in reasonable chunk sizes) so that changing the read-position in characters would not require re-reading the file from the beginning. As you point out, though, GETEOFPTR or SETFILEPTR (beyond the current highest read position) would need to process the file up to the requested position.
As a side question, I wonder what the expectations are around Unicode which has required more than 16 bits for code points for many years. I assume XCCS was and is only 16-bit code points.
That's a separate, later, question when the rest of this has been cleaned up and we've made more progress towards Unicode as the domestic character encoding (esp fonts for Unicode codes). There are architectural issues wrt small integers vs. 32 bit fixp's--Tedit can sit on top of however those issues are resolved.
I think the Find and Substitute commands also may require a whole plain-text file to be "characterized".
@orcmid -- the XCCS character codes and the NS character representation as a sequence of bytes are related but different. XCCS character codes are all 16 bits. The NS representation as a sequence of bytes has various different encodings -- you can put out shift codes that take you into a character set other than set 0, with subsequent 1 byte codes within that set, or shift into 2 byte mode, where each character has a character set and a character, or, (and I don't remember whether this was written down explicitly in the versions of the standard I worked with) you could repeat the shift code and go into 3 byte or 4 byte representations per "character" -- it would be wasteful, since there were no code points outside the 2-byte space, but...
@rmkaplan -- I agree that if Find (next) doesn't find something it's going to process to the end of the file, likewise, Substitute (all) is going to need to process the whole file.
@nbriggs "This sounds good in principle. I worry some about how this will impact the internal TEDIT operations for recording formatting information. I suspect that there's not a clean separation between character and byte counting operations on the underlying file stream and a TEdit (text) stream."
The representation in the Tedit binary format is currently confused (which I think is causing some of the glitches). It puts out a sequence of pieces each of which is either thin (8bit character codes) or fat (16 bit codes. Right now, I think it also codes up the thin/fat distinction with the NS character shifting bytes. But then each character has a particular place in the sequence, and the formatting information should therefore be in alignment without inordinate additional complexity.
Plain-text files don't have formatting information and they stay that way if you don't start fiddling with fonts, paragraphs, etc. If you do, then the piece machinery should take over and keep things staight--putting the file will put out the proper piece sequence.
TEDIT currently has an option (in the put menu) and in the code for putting out binary Tedit-files in an "old format", a format that was discontinued in around 1985. We have debugged the code for reading in old-format files. But is there any reason to maintain the ability (and the menu option) to put out files in the old format? It would be just more stuff to keep in sync.
I sincerely doubt we'll need to share newly created TEdit files with someone who has a system that can only old format TEdit files. I presume there are no tools (regular, or in the internal directories) that only process old format files without relying on the TEDIT code.
I'm concerned about the architectural impact of changing BIN to not read a single 8 bit byte, and FILEPTR GET- and SET- traffing in byte offsets. TEdit files are not text; they should be treated as binary. I think having another layer that interprets sequence-of-byte as sequence-of-character is fine, but a bad idea to mix them. MAIKO has BIN and BOUT opcodes, for example.
I also think it is fine to assert that anyone who does a SSETFILEPTR into a file with spans of bytes knows what they are doing and that we should treat those cases where the state isn't the steam default for that external format is an error.
I expect the considerations about performance in medley have changed a lot but I think it's important to preserve their simplicity.
As for Unicode above 2^16 I liked @johnwcowan's proposal in #350 to handle emoji.
I mean if you are SETFILEPTR into a file, where did the pointer come from? If it's something you got doing arithmetic, you better do byte arithmetic. If there are places in TEdit that confuse the two, let's find and fix them, they're just bugs. No need to change the architecture of files and streams.
Two alternatives for two proposals: Unicode > 2^16 -- turn characters into image objects? Only for TEdit though. setfileptr et al: define another attribute GETFILE-EXTERNALFORMAT-STATE / SETFILE-etc : default is NIL. Value depends on the EXTERNALFORMAT: so XCCS could remember the charset if it was run-coded, or the fact it was doublebyte-encoded. If anyone cares. Mabye the ^F^A state could be part of that. The FILEMAP should be constructed so that its pointers are bytes with NIL as EXTERNALFORMAT-STATE
The problem is that the byte-level is essentially random for text streams. With existing code, if you have a file that starts with βc and you do Bin’s, you get 255 38 98 255 0 99 (the raw bytes). But if you (in Tedit) put an a on the beginning (aβc), set the pointer back to zero, and start binning, you get 97 9826.
The actual sequence of bytes makes sense at the level of internal Tedit implementation, but it is not a useful interface. The character level is more consistent, and consistent with the external format behavior of other strings. GETFILEPTR and SETFILEPTR are meaningful in the sense that if you read characters up to a certain point, and get the “file” pointer X at that point, move to another position to do something else, setting it back to X will enable you to see as the next character what you would have seen if you hadn’t moved and returned.
On Jul 23, 2022, at 10:57 PM, Larry Masinter @.***> wrote:
I mean if you are SETFILEPTR into a file, where did the pointer come from? If it's something you got doing arithmetic, you better do byte arithmetic. If there are places in TEdit that confuse the two, let's find and fix them, they're just bugs. No need to change the architecture of files and streams.
— Reply to this email directly, view it on GitHub https://github.com/Interlisp/medley/issues/861#issuecomment-1193253145, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQSTUJPBCZNKEUTM5XXPOSTVVTLLPANCNFSM54OV3ZYA. You are receiving this because you were mentioned.
I'm concerned about the architectural impact of changing BIN to not read a single 8 bit byte, and FILEPTR GET- and SET- traffing in byte offsets. TEdit files are not text; they should be treated as binary. I think having another layer that interprets sequence-of-byte as sequence-of-character is fine, but a bad idea to mix them. MAIKO has BIN and BOUT opcodes, for example.
That amplifies what was bothering me when I said
I suspect that there's not a clean separation between character and byte counting operations on the underlying file stream and a TEdit (text) stream.
I don't know what it really means to have (traditional 1-byte) BIN operations on the texstream. If one implemented READCCODE to get the sequence of character codes, but BIN gets you the sequence of bytes that are just the bytes of the character code then you're losing the character code boundary information and can't reconstruct the original by BOUTing the BIN results. One could invent various other results for BIN, but... are they useful?
The problem is, a TEDIT binary file is opened and presented as a Textstream. Conceptually after that there is in fact no longer an “underlying file”. It is only a performance optimization that the whole file hasn’t been slurped up into an internal string (fat or thin, depending, with looks attached). The only consistent model is of a sequence of (looked) characters.
Below the line, at the level of Tedit implementation, it grabs characters from the backing file when it needs them to fill in parts of the conceptual string that haven’t yet been characterized. This is to defer slurping things until for whatever reason they break through to the conceptual presentation.
(For the benefit of the user it also keeps the original name of the binary file, to be used as a prompt for the Put command.)
Character operations make sense on a text stream, 8-bit byte operations don’t make sense (particularly when some of the parts of the text don’t appear in the backing file at all).
We could say that text streams are not binnable (there a stream flag for that), and BIN presumably would cause an error. That would provide early protection against something that Larry may be worried about, wherein information might be invisibly lost if the result of BIN is stored in an 8-bit field somewhere.
Or we could coerce BIN to (essentially) NTHCHARCODE on the conceputal string, which is what John tried to do (but with the stream presentation the proper function would be READCCODE).
It might make a little more sense to think of byte-level operations when a plain-text file is opened as a Textstream. But if you really wanted to treat it as bytes, you should open it up as an ordinary stream. Then you can see and manipulate the NS character shifts and the UTF-8 bytes with high-order bits, etc. Tedit want’s to present those bytes as characters.
On Jul 24, 2022, at 10:10 AM, Nick Briggs @.***> wrote:
I'm concerned about the architectural impact of changing BIN to not read a single 8 bit byte, and FILEPTR GET- and SET- traffing in byte offsets. TEdit files are not text; they should be treated as binary. I think having another layer that interprets sequence-of-byte as sequence-of-character is fine, but a bad idea to mix them. MAIKO has BIN and BOUT opcodes, for example.
That amplifies what was bothering me when I said
I suspect that there's not a clean separation between character and byte counting operations on the underlying file stream and a TEdit (text) stream.
I don't know what it really means to have (traditional 1-byte) BIN operations on the texstream. If one implemented READCCODE to get the sequence of character codes, but BIN gets you the sequence of bytes that are just the bytes of the character code then you're losing the character code boundary information and can't reconstruct the original by BOUTing the BIN results. One could invent various other results for BIN, but... are they useful?
— Reply to this email directly, view it on GitHub https://github.com/Interlisp/medley/issues/861#issuecomment-1193358593, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQSTUJMOK2N25PLNM6SV65TVVV2HFANCNFSM54OV3ZYA. You are receiving this because you were mentioned.
The problem is, a TEDIT binary file is opened and presented as a Textstream. Conceptually after that there is in fact no longer an “underlying file”. It is only a performance optimization that the whole file hasn’t been slurped up into an internal string (fat or thin, depending, with looks attached). The only consistent model is of a sequence of (looked) characters.
Below the line, at the level of Tedit implementation, it grabs characters from the backing file when it needs them to fill in parts of the conceptual string that haven’t yet been characterized. This is to defer slurping things until for whatever reason they break through to the conceptual presentation.
I agree, but in the TEdit implementation is it really grabbing characters from the backing file? That is, is it (would it be?) relying on the external format decoders to get those characters (I guess those are doing BINs on the file).
I think a textstream presented to consumers outside TEdit should not be BINnable, which would help catch inadvertent errors, as you note. Though, @masinter -- to my surprise, the Maiko BIN opcode can return a 16-bit number if you use it to read a "byte" from a textstream. I haven't investigated what path through the code does that.
Does TEDIT.GET.LOOKS let you interrogate the looks of the current position of the stream as you READCCODE through it?
On Jul 24, 2022, at 4:22 PM, Nick Briggs @.***> wrote:
I agree, but in the TEdit implementation is it really grabbing characters from the backing file? That is, is it (would it be?) relying on the external format decoders to get those characters (I guess those are doing BINs on the file).
Well, it’s grabbing the characters that it itself put there in the (binary) Tedit file, however it decided to format encode them. It is still opaque to me, I’m still just nibbling at the edges. But it does have a fat-thin distinction in its file format, and my assumption was that this would be correlated with the pieces, each piece being a collection of looks, character encodings, and an indication of how its characters are represented (e.g a bit that says whether the piece is fat or thin). My impression right now is that there may be fat/thin segments within each piece, and he defaulted to using the NS charset shifting scheme to signal the changing formats. That may be the source of some of the problems, that it may get confused if the access is not strictly sequential.
But as I said, I’m still nibbling.
In terms of reading plaintext files, I think that existing XCCS files can be dealt with relatively efficiently, without require a characterizing copy. Right now the code (incorrectly) makes a piece for the whole file. But if it is a file with NS codes and NS character set shifts, a proper piece table can be constructed just by scanning the file for segments that are bounded by the 255 shifting sequences, and those pieces can just point into file byte positions.
UTF8 will require more work.
See #906 for further development.