pycparser Constant string concatenation

The following valid c99 code

char test()
{
  char* tmp = "\07""7";
  return tmp[0];
}

is wrongly parsed and returns a c_ast.Constant object with value '\077' which is incorrect. Same goes with hexadecimal.

The easy solution is to modify CParser.p_unified_string_literal by replacing p[1].value = p[1].value[:-1] + p[2][1:] by p[1].value = p[1].value + p[2]

as simply removing double quotes it not a good idea. The modification would return a value of '\07""7' which is better but needs to be parsed to get each characters.

Another solution would be to have a list of strings for the value, but that would have way more impacts on other parts of the code (like the generator)

Jul 02 '24 16:07 Llewyllen

I don't understand the issue. The following C program prints abcxyz, according to the standard:

#include <stdio.h>

int main() {
  char* str = "abc""xyz";
  printf("%s\n", str);
  return 0;
}

Can you clarify what pycparser is doing wrong, in your opinion?

Jul 03 '24 03:07 eliben

For octal "\07""7" is a 3 bytes string composed of 0x07 (octal value 7), 0x37 (character '7') and 0x00 (string end) "\077" is a 2 bytes strings composed of 0x3F (octal value 77) and 0x00

For hexadecimal "\x7""7" is a 3 bytes string composed of 0x07, 0x37 and 0x00 "\x77" is a 2 bytes string composed of 0x77 and 0x00

So if you simply remove consecutive double quotes (what PyCParser does), you get the wrong value

char test1()
{
  char* tmp = "\07""7";
  return tmp[0];
}

char test2()
{
  char* tmp = "\077";
  return tmp[0];
}

These 2 functions do not return the same value. First one returns 0x07, second one returns 0x3F

Jul 03 '24 08:07 Llewyllen

Ah, so it's specific to octal and hex, then... PR to fix welcome, though it has to handle all cases of string literal concatenation properly

Jul 03 '24 12:07 eliben

As I said, there are not that many solutions

keep a list of strings, but has impact on other parts of PyCParser and might have an impact on people using PyCParser
keep the double quotes, but might have an impact on people using PyCParser

so I won't do a PR, as there is no ideal solution

Well, I did create a PR, not sure it will pass the tests (but it works for my needs)

From what I saw, it will not pass the test_unified_string_literals test, but then, this test is rather wrong because string concatenation is not as simple as removing consecutive double quotes.

I could add the test

d7 = self.get_decl_init(r'char* s = "\07" "7";')
self.assertNotEqual(d7, ['Constant', 'string', r'"\077"'])

and the current version would fail

I just saw that p_unified_wstring_literal has the same problem, but I won't put my hand in the widechar trap

Jul 03 '24 12:07 Llewyllen

This seems like it would also apply to all escape sequences.

According to §5.1.1.2 — Translation phases:

Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member of the execution character set; if there is no corresponding member, it is converted to an implementation- defined member other than the null (wide) character.

Adjacent string literal tokens are concatenated

Also §6.4.5 — String literals:

7 EXAMPLE This pair of adjacent character string literals "\x12" "3" produces a single character string literal containing the two characters whose values are '\x12' and '3', because escape sequences are converted into single members of the execution character set just prior to adjacent string literal concatenation.

The standard avoids exactly this issue by ensuring escape sequences in literals are converted before string literals are concatenated. So for sake of correctness, the parser should stop concatenating string literals unless it plans to convert escape sequences.

Nov 14 '24 05:11 graypinkfurball