tree-sitter-php
tree-sitter-php copied to clipboard
Support any chars PHP accept for names and labels
Currently, we only support (extended?) ASCII for names and labels in accordance with the PHP documentation. However, this is not consistent with how PHP actually works.
This entails two tings:
- Investigate what's the true definition of a legal name/label in PHP (a PR to update the official documentation will probably be appreciated)
- Modify the method
is_valid_name_charin src/scanner.cc to account for the identified rules - Modify the
namerule in grammar.js to account for the same rules
Should it be?
[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]
I just encountered with a something like
$\xf9\xdf=strlen("aaa");
which will be parsed as (function_call_expression function: (variable_name)).
To workaround tree-sitter's utf-8/utf-16 encoding, I just tweak the tree-sitter's lexer to read raw bytes. not sure if it is the right way..
@cfroystad, @amaanq,
I've been digging into this and here are my findings, the BLUF is that I don't think we need to support every char that PHP accepts as a valid identifier.
Valid PHP Labels
The PHP scanner declares that a valid LABEL is [a-zA-Z_\x80-\xff][a-zA-Z0-9_\x80-\xff]* which is consistent with what is currently listed in the docs:
- https://github.com/php/php-src/blob/b06fedb41d560f756dfb5d964b0d36a224e44dc7/Zend/zend_language_scanner.l#L1365
- https://www.php.net/manual/en/language.variables.basics.php
- https://github.com/php/php-src/blob/b06fedb41d560f756dfb5d964b0d36a224e44dc7/Zend/zend_language_scanner.l#L2345-L2347
Discrepancies
From what I've been reading, PHP accepting variable names like the following (taken from #171):
<?php
$漢字;
as valid is more of a bug rather than a feature. From my (limited) understanding, this is occurring because PHP isn't handling UTF-8 properly and is matching the individual UTF-8 bytes against the range \x80-\xff.
- https://stackoverflow.com/questions/5358824/unicode-identifiers-function-names-for-non-localization-purposes-advisable
- https://stackoverflow.com/questions/3417180/exotic-names-for-methods-constants-variables-and-fields-bug-or-feature
Path Forward
It would be easy enough to update the regex to allow any character in the UTF-8 range:
- name: _ => /[_a-zA-Z\u00A1-\u00ff][_a-zA-Z\u00A1-\u00ff\d]*/,
+ name: _ => /[_a-zA-Z\u0080-\uffff][_a-zA-Z\u0080-\uffff\d]*/,
However, doing so greatly increases the size of the parser and the large range is not efficient:
Given that: a) any character > \xff is technically illegal in PHP, and b) the parser large growth by trying to be all-inclusive, I'd say that we should stick to the legal definition.
We currently use a different regex, so if anything we should just update it to be consistent:
- name: _ => /[_a-zA-Z\u00A1-\u00ff][_a-zA-Z\u00A1-\u00ff\d]*/,
+ name: _ => /[_a-zA-Z\u0080-\u00ff][_a-zA-Z\u0080-\u00ff\d]*/,
Hey that's an interesting side effect/bug potentially. I tried random values from 0x80 to 0xa0, it only reduces after 0xa0 (a1) which is interesting. I'll dig more in core to see why that is and if it is potentially a bug
Found it, using 0xA0 is tricky because it's also used as whitespace - which is valid..anywhere due to it being in the extras.
Sidestepping this like so keeps the state count low:
name: _ => /[_a-zA-Z\u009f\u00A1-\u00ff][_a-zA-Z\u009f\u00A1-\u00ff\d]*/,
@calebdw 0x85 is also considered whitespace, but using it doesn't blow up the state count, because it wasn't explicitly added in the extras rule. We had a talk about this a while ago about whether or not to add more whitespace character here: https://github.com/tree-sitter/tree-sitter/pull/2428, and in the end we opted to add VT and FF only
Removing the explicit mention of \u00A0 fixes this blow up. So, what we can do is one of two things.
Sidestep A0 entirely in the name rule and use ranges below and above that or Remove A0 from the extras
Up to you
Ah, very interesting. But I guess this raises several questions:
- does
\u00a0need to be in the extras and what could be some side-effects of removing it? - do you think we should support the full utf-8 range
\uffffor just stop at\u00ff?
it's technically considered a whitespace character, I'll leave keeping it in extras up to you, I know almost 0 PHP
as for the range stuff, I think it'd be best to stick to what's official - if there's a discrepancy in the docs and reality because of php not handling utf-8 properly then I would go with \uffff for a better UX for downstream in these rare cases. I don't think performance will be hurt all that much, since ranges are just a simple cmp operation when compiled down (if it's not half-open like it could be for the A0 case)
Note that 200B, 2060, and FEFF are used for whitespace as well, so I'd sidestep around these too