dmd Implement C23 identifiers via UAX31 (minus normalization)

Okay the situation currently is significantly worse than even I had known.

We are only doing is alpha checks and no differentiation between start/continue. We were NOT c99 compliant.

Also the legacy start ranges turn out to be quite big (439).

This work was already discussed with @WalterBright, and this PR does not make us C23 compliant as per Walter's agreement on approach. I have added deprecation comments for the logic that will go away eventually as per Walter's agreement for the approach (I set it to 2.110 and 2.120 but those are pretty random in choice, so if there are better ones please say).

The approach that was agreed upon is to remove the non-starter characters. In doing so you enforce normalization form C. Alternatively we could implement the quick check algorithm (or even normalization) and acquire C23 compliance. However both deprecation/removal and implementation of normalization can follow in another PR. This one does not break any code, only adds ranges.

Changelog/spec PR will come after feedback from Walter.

EDIT: This PR's approach has since changed to match the Unicode Annex 31 standard as specified by the C23 standard. It is a subset thereof which only includes the tables. Normalization is not handled here. That needs follow up PR's.

TLDR: This PR changes D's identifiers to be foundationally more stable and match other compilers by using the Unicode Annex 31 standard. It also makes ImportC more compliant with C11 standard while offering more configurability if needed on picking the Identifier tables.

Jun 11 '23 04:06 rikkimax

Thanks for your pull request and interest in making D better, @rikkimax! We are looking forward to reviewing it, and you should be hearing from a maintainer soon. Please verify that your PR follows this checklist:

My PR is fully covered with tests (you can see the coverage diff by visiting the details link of the codecov check)
My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
I have provided a detailed rationale explaining my changes
New or modified functions have Ddoc comments (with Params: and Returns:)

Please see CONTRIBUTING.md for more information.

If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "master + dmd#15307"

Jun 11 '23 04:06 dlang-bot

Only ocean failing to CI, same error in another PR so not something I did.

One thing we may want to consider is supporting C11 and C99 ranges specifically as opt-in options for ImportC.

Jun 11 '23 06:06 rikkimax

I've had some more time to think about this. isValidMangling needs to be removed. Linkers should not care about what characters go into a symbol name. But they do care about encoding (which we have problems with that need to be fixed).

Until I've done this, this PR will break existing code. However, the rest of the code still needs a review @WalterBright.

EDIT: all of the doc.d stuff is wrong as well. However we may want a combined start/continue set of tables for this.

Jun 13 '23 06:06 rikkimax

I've had some more time to think about this. isValidMangling needs to be removed. Linkers should not care about what characters go into a symbol name. But they do care about encoding (which we have problems with that need to be fixed).

isValidMangling is not only there for linkers. Assemblers have special/reserved characters that should never go in a symbol name - for example ! $ - ; ? # | (all those characters are not valid C symbols either).

Jun 13 '23 07:06 ibuclaw

I've had some more time to think about this. isValidMangling needs to be removed. Linkers should not care about what characters go into a symbol name. But they do care about encoding (which we have problems with that need to be fixed).

isValidMangling is not only there for linkers. Assemblers have special/reserved characters that should never go in a symbol name - for example ! $ - ; ? # | (all those characters are not valid C symbols either).

Okay, we'll need a special table for this. Unless I'm told what the character ranges should be (ideally in the form of a UAX31 profile), I'll have to revert to the legacy table (which is a bad thing long term).

Jun 13 '23 08:06 rikkimax

Without it, you'd also be permitted to have \0 and other non-printable characters in a symbol name - yes, someone put pragma(mangle, "foo\0bar") in the testsuite and I complained about it breaking assemblers. :-)

Jun 13 '23 08:06 ibuclaw

Without it, you'd also be permitted to have \0 and other non-printable characters in a symbol name - yes, someone put pragma(mangle, "foo\0bar") in the testsuite and I complained about it breaking assemblers. :-)

Fair enough, I'm convinced. I just need to know what the profile is to implement (as that isn't related to D identifiers). See: https://unicode.org/reports/tr31/#Table_Lexical_Classes_for_Identifiers

Jun 13 '23 08:06 rikkimax

I had a quick peek at the universal character set tables in gcc, they only store the last part of the unicode character - I assume they must have some other routine that validates the value of the first part.

It looks like not all values seem to match what you've auto-generated here?

https://github.com/gcc-mirror/gcc/blob/a07dadba85f1b15e270c227dfa70e2fdf331494f/libcpp/ucnid.h#L55

Jun 13 '23 14:06 ibuclaw

NXX23 will be C23.

Second to last column will be ccc (0 is starter).

Last is character only.

The functions in it are for the quick check algorithm.

Only way for my tables to be fundamentally wrong is the not/intercept methods on ValueRanges. A lot of it is provided by Unicode.

On Wed, Jun 14, 2023, 02:59 Iain Buclaw @.***> wrote:

I had a quick peek at the universal character set tables in gcc, they only store the last part of the unicode character - I assume they must have some other routine that validates the value of the first part.

It looks like not all values seem to match what you've auto-generated here?

https://github.com/gcc-mirror/gcc/blob/a07dadba85f1b15e270c227dfa70e2fdf331494f/libcpp/ucnid.h#L55

— Reply to this email directly, view it on GitHub https://github.com/dlang/dmd/pull/15307#issuecomment-1589484149, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHSL47JFEWFEX36HKHMZGDXLB53JANCNFSM6AAAAAAZCC3F54 . You are receiving this because you authored the thread.Message ID: @.***>

Jun 13 '23 20:06 rikkimax

I've been thinking a bit about this PR for a couple of things.

Mangling checks should cover the ranges of: whitespace, punctuation, and control. Will need to allow previously legacy characters too.
I think I want to swap to using inversion lists + Fibonacci search. This should be faster, but I'll do it in another PR when I start doing some reading on search algorithms.

Jun 23 '23 17:06 rikkimax

@deadalnix has been on the ball for this and done a very good optimized table lookup based upon this PR for SDC.

https://github.com/snazzy-d/sdc/commit/199b36391f98fee3e30447a3ec397ff9a40c89f5

Jun 27 '23 19:06 rikkimax

@rikkimax this is good work. I can't help but think, though, what is the problem being solved? It's 3000 lines, do any of our users need it?

Jul 02 '23 05:07 WalterBright

Most of the code is in the tables and is 100% generated. They would be a whole lot smaller if I didn't have to split them up into starters/non-starters + legacy (which would be deprecated regardless). There is also alpha tables that'll be removed once I do the mangling/ddoc identifier implementation (yes they are different!).

My main concern is not just what we need now but in 10-20 years time (which this would easily cover), or the ranges for ImportC C99/C11 to be pickable (not in PR, but I have a plan if we want that). But there are some serious problems with our existing implementation. Like how the mangling checks are a whitelist that is the same as the D identifiers, so you can't even use a C23 identifier right now if it isn't in our legacy ranges that may not even be compliant C99!

Jul 02 '23 05:07 rikkimax

The existing implementation is pretty questionable from what I've seen, which is quite worrying. For example, it's possible to use a non-starter as a first character of an identifier. That may not even render depending on the text renderer, or if it does it'll be a box or even worse, merge into punctuation or whitespace.

Ignoring the normalization problems and not tracking the current standard; between the white list mangling, and non-starter start character for identifier, it needs work regardless of this PR.

Jul 02 '23 05:07 rikkimax

I'm a bit confused. I don't see how identifier characters affect name mangling or the linker at all. The mangled identifier names are prefixed with a count, the contents do not matter at all. Identifiers are in the object file as zero-terminated strings - that shouldn't affect the linker at all, either.

Jul 02 '23 19:07 WalterBright

I wish you were right.

https://github.com/dlang/dmd/blob/6e35f8552ef04ef089c074b9ef4bd2f2e9c15b7f/compiler/src/dmd/dmangle.d#L80

I would prefer to remove these checks entirely. But Iain wants a set of checks to remain cos of assemblers. So control + whitespace + punctuation would be black listed is my current plan.

Jul 02 '23 19:07 rikkimax

Check out where it's used. The first is in pragma(mangle, "string") doing a sanity check on the string, the second is in doGNUABITagSemantic() which I think is to check C++ name mangling. Neither of those need to limit the character set of mangled names or names put in the object file.

gdc and ldc may have limits on identifier character sets, but dmd does not.

Jul 03 '23 02:07 WalterBright

Dmd has a whitelist currently as you say. I wanted to remove them.

@ibuclaw brought the support receipts so he wants certain things like null terminators to be denied.

This one is between you two. My proposal is a blacklist with control characters + whitespace + punctuation, this will be a good solution long term.

You two should talk and come to a decision about what I do about it.

Jul 03 '23 03:07 rikkimax

Ok, given the recent policy adjustment for deprecations, I need to know if we will be continuing this approach or if we are switching to the quick check algorithm like GCC does? So that only problematic identifiers warn/error, instead of perfectly fine ones.

nothing should be deprecated unless there's a very compelling reason ("this feature was a mistake and people shouldn't be using it" is not a compelling reason)

@WalterBright

Jul 04 '23 19:07 rikkimax

I've been thinking about the cost of actually doing the quick check algorithm and I know how to do it with only the cost of one extra table and if the check occurs, it could be incredibly cheap!

First, Unicode characters only use 29 bits of a dchar, which leaves 3 we can use. The bottom-most bit can be used to represent the yes/no/maybe value. At runtime this should only require a single left shift at the start of the search function.

Next we need the CCC value for a given character, since we only store what is in our identifier ranges this should hopefully allow a lot of savings in ROM space with a multi-layer trie.

There would need to be two sets of logic, the first happens in the lookup function for start/continue. A simple comparison for the ccc value and the last one, and the yes/no/maybe value being set with an or.

In the caller, there would be two additional variables (passed by ref). Is not normalized, and the last ccc value. From there the cli arg to pick silent accept/warn/error behavior could occur.

What this means is we don't have to get rid of the non-starters and risk breaking code. It's entirely possible to keep them without slowing things down too much! For example, a single character ASCII identifier should only need to check the variable for if not normalized. If multi-characters, it only does the logic if it succeeds.

For reference the Java algorithm, minus the UTF-16 specific stuff:

public int quickCheck(String source) {
    short lastCanonicalClass = 0;
    int result = YES;
    for (int i = 0; i < source.length(); ++i) {
        int ch = source.codepointAt(i);
        short canonicalClass = getCanonicalClass(ch);
        if (lastCanonicalClass > canonicalClass && canonicalClass != 0) {
            return NO;
        }
        int check = isAllowed(ch);
        if (check == NO) return NO;
        if (check == MAYBE) result = MAYBE;
        lastCanonicalClass = canonicalClass;
    }
    return result;
}

public static final int NO = 0, YES = 1, MAYBE = -1;

Jul 18 '23 16:07 rikkimax

Total non-issue, that's handled elsewhere!

https://github.com/dlang/dmd/blob/6e35f8552ef04ef089c074b9ef4bd2f2e9c15b7f/compiler/src/dmd/lexer.d#L564

https://github.com/dlang/dmd/blob/6e35f8552ef04ef089c074b9ef4bd2f2e9c15b7f/compiler/src/dmd/lexer.d#L3120

Jan 16 '24 20:01 rikkimax

I'll likely be creating a new branch locally and starting again on some of the changes here.

Limiting to just starters had different concerns than I have now got to deal with.

Jan 16 '24 20:01 rikkimax

@WalterBright Ok, time for some decisions & review.

I made isidchar in lexer public which I am not happy about. That table could be split out, or Identifier.isValidIdentifier rewritten.
There is no way to select the tables in the lexer. It's all UAX31 only. Everywhere else I went very relaxed and did the least restrictive set that includes both UAX31 and C99 tables. I'm thinking that we want to pick which table we want. That way ImportC can support multiple versions of C. There is a picker struct ready to go, I just didn't wire it into lexer.

Normalization not done, quick check algorithm not done, tables are still pairs with binary search.

We should also consider editions. Perhaps keep c99 as default for edition 0 (current), then in 1 turn on UAX31 for D. Switch ImportC to c11 by default now.

Jan 17 '24 17:01 rikkimax

@WalterBright is your email working again? I need decisions to be made!

Oh and I had to add a merge function (because of course I was silly and used intersection), so the generator now takes a while to run. But CI was green prior to it.

Jan 19 '24 12:01 rikkimax

@rikkimax

There is no way to select the tables in the lexer.

Not necessary. Just prefix them with a test for ascii and if true skip the unicode call. ascii must be the fast path.

I made isidchar in lexer public which I am not happy about. That table could be split out, or Identifier.isValidIdentifier rewritten.

Same thing about prefixing with ascii test.

Jan 19 '24 18:01 WalterBright

@rikkimax

There is no way to select the tables in the lexer.

Not necessary. Just prefix them with a test for ascii and if true skip the unicode call. ascii must be the fast path.

I made isidchar in lexer public which I am not happy about. That table could be split out, or Identifier.isValidIdentifier rewritten.

Same thing about prefixing with ascii test.

That is done.

What I am asking about is supporting C99, C11, UAX31, least restrictive (we have that for things like ddoc) tables all at once, and picking which one you need.

This also helps with ImportC so it could be fully compliant with a specific C version.

It also means that if UAX31 ends up breaking peoples code they can pick the C99 tables to keep things compiling.

Jan 19 '24 19:01 rikkimax

Basically, I need a decision like: UAX31 only, single C version for ImportC, or here is how we'll pick which table to use.

UAX31 doesn't have all C99 characters, so it's breakage.

The choice to potentially break, should come from you.

Jan 19 '24 19:01 rikkimax

I suggest it be a runtime switch, controlled by a bool in CompileEnv. In order to not break existing code, a -preview=uax31 should do the trick.

Jan 21 '24 17:01 WalterBright

I suggest it be a runtime switch, controlled by a bool in CompileEnv. In order to not break existing code, a -preview=uax31 should do the trick.

I ended up implementing it as: https://github.com/dlang/dmd/pull/15307/files#diff-7889d62d20faf1815f3d4eb0d9835759311f81e719ea3eeaf13e1165b7bdfe2d

Reversible and we'll be able to switch over permanently in like 10 releases. Although we can up it, but I don't think that is required.

Jan 21 '24 17:01 rikkimax

why penalize people for not hitting the shift key?

Please do not do this. The lower case ones will be fine. All the other switches are case-sensitive.

Jan 21 '24 18:01 WalterBright

dmd dmd copied to clipboard

Implement C23 identifiers via UAX31 (minus normalization)

Bugzilla references

Testing this PR locally

dmd
dmd copied to clipboard