language icon indicating copy to clipboard operation
language copied to clipboard

Specify source code encoding?

Open eernstg opened this issue 4 years ago • 9 comments

Thanks to @osa1 for pointing out that the specification may be too vague in this respect!

The language specification and other specification documents seem to be essentially silent on the question about source code encoding.

In particular, the step that obtains a 'library' (which is a semantic concept that allows the language processing steps like static analysis and code generation to obtain semantic entities like 'class declarations' and the like) from a URI is simply the following:

Let Li be the library obtained from the source code denoted by si.

In this sentence, si is the string which is obtained from a <configurableURI> after evaluation of the if constructs, and the interpretation of a URI is essentially implementation defined:

This specification does not discuss the interpretation of URIs, with the following exceptions. ... All further interpretation of URIs is implementation dependent.

The current implementations seem to accept UTF-8 encoded source code, that is, pure 7 bit ASCII for everything other than string literals, and UTF-8 in string literals.

It is mentioned that strings are represented (at run time, presumably) as UTF-16 code units, which could be relevant in the context of JavaScript interop code, but it is hardly a constraint on the encoding which is to be used in the external representation of libraries.

It may be fine to leave the external representation of Dart source code as an implementation specific property, but it is tempting to at least specify that UTF-8 must be among the encodings which are supported.

@munificent, @natebosch, @jakemac53, @lrhn, @stereotype441, @leafpetersen, WDYT?

eernstg avatar Apr 04 '22 15:04 eernstg

The spec does say:

Dart source text is represented as a sequence of Unicode code points.

I can dimly recall some ancient discussions with the old language team about this and I think they felt at the time that the spec should not sully itself with the details of encodings and leave that up to implementations.

I think this caused some frustration on the part of the various implementation teams because it wasn't clear what encodings they were obliged to support or how to detect it. One could, in principle, write a Dart compiler that required UTF-32 or even ASCII and nominally be within the bounds of the spec even though it would cause lots of user pain.

I think mandating that they have to support UTF-8 makes sense. I would even be fine saying that Dart source files are always UTF-8. I don't know if any of our tools even support any other encoding. If they do, I have no idea if anyone relies on that fact.

munificent avatar Apr 04 '22 19:04 munificent

The spec does say:

Dart source text is represented as a sequence of Unicode code points.

Right, and that's actually a subtle statement because it restricts the source code to be some encoding of Unicode (such as UTF-8 or UTF-16), but it doesn't say which one. Oh, the purity! ;-)

eernstg avatar Apr 04 '22 22:04 eernstg

The current implementations seem to accept UTF-8 encoded source code, that is, pure 7 bit ASCII for everything other than string literals, and UTF-8 in string literals.

Minor nit: the implementations also accept UTF-8 encoded non-ASCII text in comments.

Anyway, FWIW, I'm in favor of making the spec more explicit about what we accept. I don't think there's any benefit in leaving this detail "up to the implementation" when in practice, we are the implementors, and we know exactly what we did 😄

stereotype441 avatar Apr 05 '22 15:04 stereotype441

I think whether the spec should specify source code encoding or not depends on what purpose the spec serves:

  1. As a documentation of the implementation

    • In this case we should say that we expect UTF-8, as that's what we do, and there is no motivation to support any other encoding.
  2. As a specification for the Dart language users to aid understanding of Dart programs, features, etc.

    • In this case the spec should specify concrete syntax, but no need to specify encoding. As long as the user can somehow view some Dart code the spec will help understanding it.
  3. As a specification for Dart implementors to follow, to allow alternative implementations, compilers, interpreters, static analyzers, debuggers, etc.

    • In this case we may still choose to not specify source code encoding, but that will lead to less compliance between these tools as in principle a tool author may choose to support other encodings, and GitHub repos, libraries may appear with files in different encodings that only work in some implementations and not others.
    • In general, specifying more should lead to more compliance, unspecified details will cause implementations to diverge, and programs/libraries that work with some implementations and not others may start to appear.

In section 1 (Scope), we say that an we specify syntax and semantics, and in section 2 (Conformance) we say "a conforming implementation of the Dart programming language must provide and support all the APIs". I'm not sure what "API" here means exactly. If we're talking about the library methods, properties, classes etc. then I would guess that it's possible for an implementation to provide the same API, but maybe some features work differently. But in any case, it seems like we want the spec to serve as (3). So the question now is how much compliance we want between alternative implementations.

osa1 avatar Apr 07 '22 09:04 osa1

I'd emphasize purpose 2 and 3: The language specification (plus any temporary feature specifications) should specify the language for developers, such that they can construct Dart programs with whatever behavior they need. At the same time, the specification should specify the language for tool implementers, such that they can enable analysis and compilation/execution of said programs, and such that Dart developers and tool implements agree on which behavior to expect when the programs are analyzed, compiled, and executed.

You could say that purpose 1 is achieved as a consequence of the other two.

In practice, developers may well rely on lots of other things, e.g., community folklore and personal advice, more informal tutorials, StackOverflow questions, etc., but during the construction of all of those sources of information there would ideally be a dependency chain that ends up in the language specification.

About the APIs: That would be declarations in standard libraries, e.g., the methods of Iterable and such. I think section '2 Conformance' was written many years ago in order to satisfy formal requirements on an ECMA standards document. Presumably, the main purpose was to ensure that a conforming implementation would conform to not just the language specification itself, but it would also have compatible standard libraries.

eernstg avatar Apr 07 '22 09:04 eernstg

As historical curiosity, the Dart spec used to say:

Dart source text is represented as a sequence of Unicode code points normalized to Unicode Normalization Form C.

from version 0.05 to 0.20. The normalization requirement was removed in version 0.30, which brought us to the current phrasing.

I think that's sufficient for the language specification. We can then have a separate tool specification which requires those Unicode code points to be represented by a valid UTF-8 encoding. We can, at any time, choose to allow other encodings without needing to change the language specification. (Well, if we allow literal unpaired surrogates in strings, which I think we currently do, we can't require it to be valid UTF-8. We should probably change "code points" to "scalar values" and require you to escape your literal unpaired surrogates).

So, the Dart source code provided to our tools must be valid UTF-8 encodings of Unicode scalar values. That disallows UTF-8 encodings of surrogate values (Unicode scalar values are Unicode code points except for surrogate values). It also disallows invalid UTF-8 encodings, like overlong encodings or 5- and 6-byte sequences.

(If we're going to get nitpicky, DartPad actually creates programs as UTF-16 code units. I guess we convert to UTF-8 before parsing?)

lrhn avatar Apr 07 '22 09:04 lrhn

@lrhn, as far as I can see you're the only one who does not support specifying that Dart source code must be UTF-8 encoded. Is this an overinterpretation? If we decide that it should be specified by tools, not the language specification, do we have a location where we can specify requirements that pertain to all tools?

Anyway, no matter where it is specified, it would presumably enable all tools to reject invalid UTF-8 as source code (and unless each tool supports some other encodings as well, they probably should reject it). This would not eliminate the need for any further checks at run time (if we do actually check this) about the correctness of the UTF-16 representation. For instance, String.fromCharCodes(...) can introduce a wrong UTF-16 encoding, unless it checks at run time.

In other words, this is not going to make String operations faster, and it is not going to improve the guarantees that every string at run time is correct UTF-16.

The only change would be that every developer/organization out there will know for sure that it is a safe bet to commit to UTF-8, when it comes to the encoding of Dart source code.

eernstg avatar Apr 08 '22 14:04 eernstg

I'm not opposed to specifying that we want valid UTF-8 as input to our compilers, I'm just not convinced it needs to be part of the language specification. I would have no problem if someone chose to implement the Dart language and took their source code as UTF-16 instead or, preferably, as well. It would still be Dart.

The Dart SDK contains a toolchain which requires UTF-8. That's fine too. Nobody can claim that it doesn't implement the Dart language either. And no-one can claim that the string literal "var x;" doesn't contain valid Dart source because it's not UTF-8 encoded (it's UTF-16, because that's what Dart strings are).

So, where do we document it? I honestly have very little idea where we put Dart documentation, since I rarely need to read it. The language specification is definitely not something people read either. There's a lot of things our tools do, which is outside of the language proper, so where do we document that. Maybe we should write it where we document the command line arguments (https://dart.dev/tools/dart-tool), and say that a <DART_FILE> must be UTF-8 encoded.

lrhn avatar Apr 08 '22 19:04 lrhn

I don't know if this is very relevant, but I want to point out that Java does not enforce source code to be in one particular unicode encoding.

I can encode a source code using UTF-16, and javac will fail unless I specify javac --encoding utf-16.

Example with a UTF-16 hello-world file:

▶ javac src/Hello.java 
src/Hello.java:1: error: unmappable character (0xFE) for encoding UTF-8
��class Hello {
^
src/Hello.java:1: error: unmappable character (0xFF) for encoding UTF-8
��class Hello {

Using --encoding utf-16:

▶ javac -encoding utf-16 src/Hello.java

<no error>

This is useful for some edge cases like avoiding converting very long strings from utf-8 to utf-16 (java internal String representation) when compiling... but not sure how much difference that would make as I've never measured that.

renatoathaydes avatar Aug 03 '22 17:08 renatoathaydes

I also think that the language specification is not suitable place for fixed character encoding of source code. I think it is ok that the language specification specify the default encoding of source code, though. The only tools which doesn't follow the default encoding should specify their acceptable encoding by themselves at suitable place.

Cat-sushi avatar Oct 27 '22 05:10 Cat-sushi

Status as of Jan 2024: We do not have widespread support for specifying the source code encoding as part of the language, not even requiring that UTF-8 must be among the supported encodings.

However, there is no substantial pushback against the idea that tools would document which encodings they accept, and how to specify the choice of encoding if needed (cf. the -encoding utf-16 option that javac supports).

Closing: This means that the question of this issue pertains to tools, not to the Dart language as such.

eernstg avatar Jan 08 '24 11:01 eernstg

Thanks for clarifying @eernstg! That sounds like a good resolution to me :)

I've opened https://github.com/dart-lang/site-www/issues/5453 to track the work of documenting this for some dart tools at least on the website. If anyone has ideas of where or how, please add to the discussion there. Thanks!

parlough avatar Jan 08 '24 11:01 parlough

Sounds good, thanks!

eernstg avatar Jan 08 '24 12:01 eernstg