tree-sitter-php icon indicating copy to clipboard operation
tree-sitter-php copied to clipboard

Support any chars PHP accept for names and labels

Open cfroystad opened this issue 3 years ago • 2 comments
trafficstars

Currently, we only support (extended?) ASCII for names and labels in accordance with the PHP documentation. However, this is not consistent with how PHP actually works.

This entails two tings:

  1. Investigate what's the true definition of a legal name/label in PHP (a PR to update the official documentation will probably be appreciated)
  2. Modify the method is_valid_name_char in src/scanner.cc to account for the identified rules
  3. Modify the name rule in grammar.js to account for the same rules

cfroystad avatar Jun 09 '22 09:06 cfroystad

Should it be?

[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]

I just encountered with a something like

$\xf9\xdf=strlen("aaa");

which will be parsed as (function_call_expression function: (variable_name)).

rever1and avatar Sep 28 '22 11:09 rever1and

To workaround tree-sitter's utf-8/utf-16 encoding, I just tweak the tree-sitter's lexer to read raw bytes. not sure if it is the right way..

rever1and avatar Oct 12 '22 11:10 rever1and

@cfroystad, @amaanq,

I've been digging into this and here are my findings, the BLUF is that I don't think we need to support every char that PHP accepts as a valid identifier.

Valid PHP Labels

The PHP scanner declares that a valid LABEL is [a-zA-Z_\x80-\xff][a-zA-Z0-9_\x80-\xff]* which is consistent with what is currently listed in the docs:

  • https://github.com/php/php-src/blob/b06fedb41d560f756dfb5d964b0d36a224e44dc7/Zend/zend_language_scanner.l#L1365
  • https://www.php.net/manual/en/language.variables.basics.php
  • https://github.com/php/php-src/blob/b06fedb41d560f756dfb5d964b0d36a224e44dc7/Zend/zend_language_scanner.l#L2345-L2347

Discrepancies

From what I've been reading, PHP accepting variable names like the following (taken from #171):

<?php
$漢字;

as valid is more of a bug rather than a feature. From my (limited) understanding, this is occurring because PHP isn't handling UTF-8 properly and is matching the individual UTF-8 bytes against the range \x80-\xff.

  • https://stackoverflow.com/questions/5358824/unicode-identifiers-function-names-for-non-localization-purposes-advisable
  • https://stackoverflow.com/questions/3417180/exotic-names-for-methods-constants-variables-and-fields-bug-or-feature

Path Forward

It would be easy enough to update the regex to allow any character in the UTF-8 range:

-      name: _ => /[_a-zA-Z\u00A1-\u00ff][_a-zA-Z\u00A1-\u00ff\d]*/,
+      name: _ => /[_a-zA-Z\u0080-\uffff][_a-zA-Z\u0080-\uffff\d]*/,

However, doing so greatly increases the size of the parser and the large range is not efficient:

image

Given that: a) any character > \xff is technically illegal in PHP, and b) the parser large growth by trying to be all-inclusive, I'd say that we should stick to the legal definition.

We currently use a different regex, so if anything we should just update it to be consistent:

-      name: _ => /[_a-zA-Z\u00A1-\u00ff][_a-zA-Z\u00A1-\u00ff\d]*/,
+      name: _ => /[_a-zA-Z\u0080-\u00ff][_a-zA-Z\u0080-\u00ff\d]*/,

calebdw avatar Jan 26 '24 22:01 calebdw

Hey that's an interesting side effect/bug potentially. I tried random values from 0x80 to 0xa0, it only reduces after 0xa0 (a1) which is interesting. I'll dig more in core to see why that is and if it is potentially a bug

amaanq avatar Jan 26 '24 22:01 amaanq

Found it, using 0xA0 is tricky because it's also used as whitespace - which is valid..anywhere due to it being in the extras.

Sidestepping this like so keeps the state count low: name: _ => /[_a-zA-Z\u009f\u00A1-\u00ff][_a-zA-Z\u009f\u00A1-\u00ff\d]*/,

@calebdw 0x85 is also considered whitespace, but using it doesn't blow up the state count, because it wasn't explicitly added in the extras rule. We had a talk about this a while ago about whether or not to add more whitespace character here: https://github.com/tree-sitter/tree-sitter/pull/2428, and in the end we opted to add VT and FF only

Removing the explicit mention of \u00A0 fixes this blow up. So, what we can do is one of two things.

Sidestep A0 entirely in the name rule and use ranges below and above that or Remove A0 from the extras

Up to you

amaanq avatar Jan 26 '24 23:01 amaanq

Ah, very interesting. But I guess this raises several questions:

  • does \u00a0 need to be in the extras and what could be some side-effects of removing it?
  • do you think we should support the full utf-8 range \uffff or just stop at \u00ff?

calebdw avatar Jan 26 '24 23:01 calebdw

it's technically considered a whitespace character, I'll leave keeping it in extras up to you, I know almost 0 PHP

as for the range stuff, I think it'd be best to stick to what's official - if there's a discrepancy in the docs and reality because of php not handling utf-8 properly then I would go with \uffff for a better UX for downstream in these rare cases. I don't think performance will be hurt all that much, since ranges are just a simple cmp operation when compiled down (if it's not half-open like it could be for the A0 case)

Note that 200B, 2060, and FEFF are used for whitespace as well, so I'd sidestep around these too

amaanq avatar Jan 27 '24 00:01 amaanq