PHP-Parser icon indicating copy to clipboard operation
PHP-Parser copied to clipboard

Hex chars greater than \x7f aborts silently the parsing

Open nicolasrod opened this issue 6 years ago • 8 comments

Hi. I ran into an issue regarding hex chars in a double quoted string. If I have a piece of code like the following:

<?php 
$a = "\x6f";

I get as a result the following:

[{"nodeType":"Stmt_Expression","expr":{"nodeType":"Expr_Assign","var":{"nodeType":"Expr_Variable","name":"a","attributes":{"startLine":2,"endLine":2}},"expr":{"nodeType":"Scalar_String","value":"o","attributes":{"startLine":2,"endLine":2,"kind":2}},"attributes":{"startLine":2,"endLine":2}},"attributes":{"startLine":2,"endLine":2}}]

But if the variable hold a value greater than \x7f, I get an empty array as a result and no error. Any ideas? Thank you!

nicolasrod avatar Nov 24 '18 03:11 nicolasrod

The problem here is probably in the JSON encoding. JSON only allows valid UTF-8 in strings, and \x7f is not a valid UTF-8 sequence.

nikic avatar Nov 24 '18 10:11 nikic

@nikic I don't understand your answer here. A string in PHP is an array of bytes, so any valid byte values are allowed. The problem is that you're representing it as a string in JSON, instead of as an array of numbers.

performantdata avatar Apr 25 '19 22:04 performantdata

Any update about this issue ?

tiyeuse avatar May 24 '19 12:05 tiyeuse

Nope. Any suggestions on what to do about this?

nikic avatar May 24 '19 12:05 nikic

Before converting ast to json, iterate through all nodes and encode the variable containing the illegal utf-8 string using base64_encode.

zhaoyanliang2 avatar May 24 '19 12:05 zhaoyanliang2

Any suggestions on what to do about this?

The problem is that you're representing it as a string in JSON, instead of as an array of numbers.

So represent it as that. A PHP string is not an array of Unicode characters, it's just an array of bytes.

This nature of the string type explains why there is no separate “byte” type in PHP – strings take this role.

So stop trying to convert an arbitrary sequence of bytes into UTF-8.

performantdata avatar May 24 '19 18:05 performantdata

Before converting ast to json, iterate through all nodes and encode the variable containing the illegal utf-8 string using base64_encode.

That sounds reasonable. We can add two extra visitors for encoding/decoding all strings in base64. It's unfortunate that this is necessary, but don't really see a way around.

nikic avatar May 24 '19 21:05 nikic

Bump on this error :smiley: Will a fix be deployed ?

tiyeuse avatar Jul 23 '19 13:07 tiyeuse