StringEncodings.jl icon indicating copy to clipboard operation
StringEncodings.jl copied to clipboard

Introduce Encoding parametric singleton type

Open nalimilan opened this issue 9 years ago • 14 comments

First step towards efficient encoders for common encodings, as well as towards providing information about encodings.

This also allows adding convenience methods to base I/O functions taking an additional encoding parameter without risking ambiguities.

See the new tests for an illustration of the API.

@ScottPJones What do you think of this PR? I've tried implementing most of the features from https://github.com/quinnj/Strings.jl/pull/3/, but with a parametric singleton type Encoding. This allows supporting arbitrary encodings, and generating methods on-the-fly without polluting the methods table with support for all possible encodings.

But I must say I don't know why you need these functions (like codeunit or native_endian), so I cannot tell whether this will work for you.

TODO:

  • classify the encodings currently in encodings_other. Can all of non-UTF/UCS encodings be considered as 8-bit?
  • handle aliases like UTF16LE
  • test the AbstractString convenience methods

nalimilan avatar Feb 13 '16 17:02 nalimilan

I'll start reviewing this this weekend. Great to see more being done to handle strings in a good way!

Did you look at the discussions about making the encodings use traits? Will this be able to handle the having some sort of hierarchy of encodings? (i.e. UTF-16LE / UTF-16BE being both UTF-16, the only difference being the endianness?) That is why I wanted native_endian, so that the code could be make more generic, using a simple call to a function that swaps the bytes if not native endian, also codeunit, which would be UInt8 for all byte oriented encodings, but UInt16 for all of the UTF-16* variants, and UInt32 for the UTF-32* ones.

I think the encodings can be classified by the code unit, whether they are native or opposite endian (for cases where the code unit is 2 or 4 bytes), whether they take 1, 2, or more code units to represent each code point, and whether or not the code points are Unicode (UTF-8, UTF-16, UTF-32 and variants), a subset of Unicode (such as ASCII, ANSI Latin 1, UCS-2), ASCII compatible (such as CP1252, where the first 128 code points are ASCII), or not even ASCII compatible (such as EBCDIC, and a few others). The distinction between 16-bit UCS-2 (which can be directly indexed) and UTF-16 (which could be called a DWCS, and can't be directly indexed), can be very important for performance. 8-bit character sets are much easier to handle efficiently, and can be done with simple tables, whereas usually for the multibyte (except UTF-8) you need special code + large tables for both directions. I've had to deal with Shift-JIS, EUC, GB, and Big5 a lot in the past. Note that EUC-JP is not a DBCS, it is a MBCS (characters added by the later standard take 3 bytes).

ScottPJones avatar Feb 13 '16 20:02 ScottPJones

This definitely looks like a good start! I hope you don't mind all the comments!

ScottPJones avatar Feb 13 '16 20:02 ScottPJones

I was just thinking also, a lot of the classification that I'd like to see can be done programmatically, for example to check if an 8-bit character set is ASCII compatible or not, and whether it is single, double, or multi byte, by running through all of the characters in the Unicode character set and checking the results. For example, CP864 (an Arabic char set) looks compatible, but it is not (the % character is replaced by \u066a).

ScottPJones avatar Feb 13 '16 21:02 ScottPJones

I was just thinking also, a lot of the classification that I'd like to see can be done programmatically, for example to check if an 8-bit character set is ASCII compatible or not, and whether it is single, double, or multi byte, by running through all of the characters in the Unicode character set and checking the results. For example, CP864 (an Arabic char set) looks compatible, but it is not (the % character is replaced by \u066a).

Actually, I've just bumped into this: http://demo.icu-project.org/icu-bin/convexp?conv=hp-roman8 It seems that ICU provides information about all encodings, and in particular whether it's ASCII-compatible.

nalimilan avatar Feb 14 '16 15:02 nalimilan

Ah, that's great, it also has the information to decide whether it is single, double, or multi code unit, I see.

ScottPJones avatar Feb 14 '16 15:02 ScottPJones

@ScottPJones Please have a look at the stub EncodingInfo type and to the partial list of encodings. Do you think this provides all the information we need?

nalimilan avatar Feb 14 '16 15:02 nalimilan

The new encodinginfo stuff looks much better, yes. It will be nice if we can come up with a way to automatically generate the tables, from either iconv or ICU. Another thing might be to somehow have the encoding stuff be able to have the information for the tables for encodings that we directly support, while allowing automatically falling back to iconv for encodings that we simply don't care about that much (like UTF7 and most of the obsolete EUC, GB, Big5, Mac, etc ones).

ScottPJones avatar Feb 14 '16 15:02 ScottPJones

The new encodinginfo stuff looks much better, yes. It will be nice if we can come up with a way to automatically generate the tables, from either iconv or ICU.

I think it would be easier to take the code that does the same thing in iconv-lite: https://github.com/ashtuchkin/iconv-lite/blob/master/generation/gen-sbcs.js (generated file: https://github.com/ashtuchkin/iconv-lite/blob/master/encodings/sbcs-data-generated.js)

Another thing might be to somehow have the encoding stuff be able to have the information for the tables for encodings that we directly support, while allowing automatically falling back to iconv for encodings that we simply don't care about that much (like UTF7 and most of the obsolete EUC, GB, Big5, Mac, etc ones).

Yes, that was the idea. Using the Tim Holy traits trick based on the encodings info, it should be easy to override the current StringEncoder and StringDecoder where we have a specialized version.

nalimilan avatar Feb 14 '16 17:02 nalimilan

Those generators from iconv-lite look nice (even if they are in JS instead of Julia! ;-) ), I see he does what I'd been talking about, and checks to see if the first half of the table is the same as ASCII. For a lot of the newer 8-bit ISO character sets, it's also useful to check if the range 0x0:0x9f is identical to ANSI Latin 1/Unicode. Another property that is very useful to keep track of, for efficient converters, that can be figured out by the generator, is to know if all of the characters map to the BMP (because then smaller tables can be used) (and in fact, if all characters map to a particular section of the BMP, which is frequently the case). What I wonder, maybe you have some ideas, is what would be the best way to set things up so that we can either directly use iconv or ICU or whatever, if we don't support an encoding, and for the ones we directly support, have different methods for different classes of encodings, for cases where no tables are needed, or where the only differences would be tables, possibly loaded from a binary file at run-time. Python 3 seems to have a framework that allows all of that.

It would really be nice for Julia to have best-in-class support for character sets, encodings & strings, even compared to Python 3 and Swift 2.0!

ScottPJones avatar Feb 14 '16 19:02 ScottPJones

What I wonder, maybe you have some ideas, is what would be the best way to set things up so that we can either directly use iconv or ICU or whatever, if we don't support an encoding, and for the ones we directly support, have different methods for different classes of encodings, for cases where no tables are needed, or where the only differences would be tables, possibly loaded from a binary file at run-time. Python 3 seems to have a framework that allows all of that.

I think traits allow for exactly this kind of thing. You just need to add methods for StringEncoder and StringDecoder based on the information we have about encodings.

nalimilan avatar Feb 14 '16 21:02 nalimilan

bump (even though it's your own PR ;-) ) This has fallen behind the main branch, but still seems a very nice improvement, if you plan to move forward with StringEncodings.jl.

ScottPJones avatar Jan 18 '17 03:01 ScottPJones

That's not the top of my priorities right now, though I'd be happy to review a PR if you want to update it. Do you need a feature in particular?

nalimilan avatar Jan 18 '17 09:01 nalimilan

OK, I'm not sure how I'd make a PR on this PR though.

ScottPJones avatar Jan 18 '17 13:01 ScottPJones

Just open a new PR. Anyway only the second commit is useful here IIRC.

nalimilan avatar Jan 18 '17 14:01 nalimilan