Escapes in unified string literals

Open arigo opened this issue 5 years ago • 1 comments

C_Parser.p_unified_string_literal() shows a potential issue: in C, the two string literals "\123" and "\1" "23" are different. In the first case, it is a single character S; in the second case, it is three characters, the first one of which has got ord == 1. But pycparser reports the same for both cases, namely the 6-character string "\123". This is due to p_unified_string_literal() that concatenates the unprocessed strings and removes the two " characters in the middle. The program using pycparser cannot distinguish the two cases any more.

I'm not sure how this problem should be fixed. Maybe one way is for pycparser to actually process the strings (e.g. in a new attribute of the Constant object), which can then meaningfully be concatenated in all cases. Another way would be to add a list of strings on the Constant, to remember the original divisions.

Sep 30 '20 12:09 arigo

Interesting report, thank you.

I agree this is an issue - found it mentioned in 6.4.5. of the C11 spec.

Processing the strings seems undesirable to me, since pycparser tries not to do that in general (not deal with potential encodings of strings, at least). Keeping a list of strings on the Constant is probably a safer option. Both options are "breaking changes" in a way, unfortunately.

Sep 30 '20 13:09 eliben