c3c icon indicating copy to clipboard operation
c3c copied to clipboard

Hex, octal and binary constants should not be unsigned by default

Open chqrlie opened this issue 6 months ago • 9 comments

The current semantics for integer constants is a small departure from C: Binary, octal and hexadecimal will implicitly be unsigned. This is not true in C except if the constant is between 0x80000000 and 0xffffffff and the equivalent 64-bit range.

With the current C3 semantics, -0x1 is positive but would be negative in C. If C3 is meant to adhere to the C semantics for integer constants, this should be fixed.

Test case:

// These pass OK
$assert(-0x80000000 > 0);
$assert(-0x80000000 == 0x80000000);
$assert(-0x8000000000000000 > 0);
$assert(-0x8000000000000000 == 0x8000000000000000);
$assert($sizeof(0x7fffffff) == 4);
$assert($sizeof(0x80000000) == 4);
$assert($sizeof(2147483648) == 8);
// These don't:
$assert(-0x1 < 0);
$assert(-01 < 0);
$assert(-0b1 < 0);

These C semantics are quite confusing, I suggest negating a uint or a ulong should be generate an error or at least a warning.

chqrlie avatar Jun 01 '25 21:06 chqrlie

My proposed change would do the following:

  1. Bake the - so that the check is on the actual value.
  2. Negative numbers are always signed.
  3. Hex/oct/binary will use the number of characters to determine the minimum type. For example, an 16 character hex will be at least long/ulong.
  4. Negating an explicitly unsigned literal is an error, e.g. -1U
  5. Hex/oct/binary is unsigned by default, but signed if - or a i suffix.
  6. Dec is signed by default

lerno avatar Jun 04 '25 18:06 lerno

Any opinion on this @chqrlie

lerno avatar Jun 06 '25 13:06 lerno

Ping

lerno avatar Jun 08 '25 22:06 lerno

I'll close it then.

lerno avatar Jun 12 '25 00:06 lerno

My proposed change would do the following:

Sorry about the lag, I did not get notified for this proposal

  1. Bake the - so that the check is on the actual value.

I am not sure what you mean by Bake the -... If you mean that -1 becomes an integer literal instead of an expression, I don't like it and I think it will create problems in macros and templates.

  1. Negative numbers are always signed.

There a no negative numbers, there are signed types that have a negative value. Expressions have a type that is either signed or unsigned. Expressions involving literals should behave the same as the same expressions with named constants or variables.

The subtle questions are:

  • what is the type of the subtraction of 2 unsigned types?
  • what is the type of the negation of an unsigned type?
  1. Hex/oct/binary will use the number of characters to determine the minimum type. For example, an 16 character hex will be at least long/ulong.

This is questionable: would 0x000000000 be a ulong then?

  1. Negating an explicitly unsigned literal is an error, e.g. -1U

I tend to agree on this one. More generally, negating an unsigned expression should at least generate a warning, possibly an error.

  1. Hex/oct/binary is unsigned by default, but signed if - or a i suffix.

This rule is too subtle for most programmers and does not fix the problem: would i + 0xFF still become unsigned ? This is even less intuitive than the C rule (Hex/oct/binary are signed by default unless they have a value in the ranges [INT_MAX+1 .. UINT_MAX], [LONG_MAX+1 .. ULONG_MAX], [LLONG_MAX+1 .. ULLONG_MAX]). My take is i + 0xFF should have the same type as +i, ie: the type of i after integer promotion. With hex constants unsigned by default, this would not be true if i is an int as 0xFF would be a uint.

  1. Dec is signed by default

Agreed. That's the C rule and most compilers issue a warning for 18446744073709551615 as it becomes unsigned due to lack of a large enough signed type.

chqrlie avatar Jun 23 '25 08:06 chqrlie

The -1 parsing solves the problem of being able to write INT_MIN without the type being promoted to long.

Consider sizeof(-2147483648) in C. This one returns 8, while sizeof(-2147483647) returns 4. Including - in parsing means that C3 can give the type of -2147483648 to be int and not long.

what is the type of the subtraction of 2 unsigned types? what is the type of the negation of an unsigned type?

The same unsigned type for both.

This is questionable: would 0x000000000 be a ulong then?

Yes.

would i + 0xFF still become unsigned

No, in C3 signed dominates over unsigned. If i is an int then we get i + (uint)0xFF after promotion. Then selecting the maximal type, which is int. After which both sides are implicitly converted to int, leaving the end result as i + (int)0xFF

lerno avatar Jun 23 '25 14:06 lerno

Since we continue the discussion, let me reopen this.

lerno avatar Jun 23 '25 14:06 lerno

The -1 parsing solves the problem of being able to write INT_MIN without the type being promoted to long.

Consider sizeof(-2147483648) in C. This one returns 8, while sizeof(-2147483647) returns 4. Including - in parsing means that C3 can give the type of -2147483648 to be int and not long.

Indeed sizeof(-2147483648) is 8 whereas sizeof(-2147483647-1) is 4 in C, and this is not very intuitive, yet I would much prefer a warning on 2147483648 suggesting the use of the L suffix.

Baking the unary - into the integer literal token is opening pandora's box: here are a few examples:

  • what is the type of -0x1 ?
  • do we have sizeof(-2147483648) != sizeof(- 2147483648) ?
  • if you ever though of adding the exponentiation operator **, this would make -1**2 equal 1 instead of -1.
  • in a template, how would you parse -n where n is a template argument ?

what is the type of the subtraction of 2 unsigned types? what is the type of the negation of an unsigned type? The same unsigned type for both.

OK, what about mixed types ? int + uint -> int or uint ?

This is questionable: would 0x000000000 be a ulong then? Yes.

It looks like a hack... the L suffix is a much more readable way to specify the type: 0x0L or 0L. Btw would 0x0L be a ulong ?

would i + 0xFF still become unsigned

No, in C3 signed dominates over unsigned. If i is an int then we get i + (uint)0xFF after promotion. Then selecting the maximal type, which is int. After which both sides are implicitly converted to int, leaving the end result as i + (int)0xFF

This is a major departure from the C semantics where int + uint -> uint. Nasty side effect:

uint x = 0xffffffff;
if (x - 1 > 0) {
    printf("OK");
} else {
    printf("not OK");  // this branch get executed if adding a signed and an unsigned evaluates to a signed.
}

Worse even: x + 0 becomes signed too :(

There is no magical solution to this semantic nightmare, but departing from subtle rules documented and learned by millions of programmers seems a bad idea. Simplicity should dictate this:

  • the same semantics should apply to expressions involving literals and variables
  • there should be a clear rule to determine the type of a literal
  • C expression semantics should not be changed unless absolutely necessary.
  • ambiguous and confusing expressions should be marked as requiring parentheses or other explicit markers (suffix, casts...)

chqrlie avatar Jun 27 '25 14:06 chqrlie

what is the type of -0x1 ?

int

do we have sizeof(-2147483648) != sizeof(- 2147483648) ?

No.

if you ever though of adding the exponentiation operator **, this would make -1**2 equal 1 instead of -1.

I considered it very early on. It's definitely not in.

in a template, how would you parse -n where n is a template argument ?

The normal way.

OK, what about mixed types ? int + uint -> int or uint ?

int

Btw would 0x0L be a ulong ?

It would be a long

This is a major departure from the C semantics where int + uint -> uint. Nasty side effect:

Sign changing issues with uint only occur if they exceed INT_MAX. Compare this to the int + uint -> uint of C, which has issues for any negative value of int. It's just so much worse. Use U on constants when you have uint, that's the simple rule.

the same semantics should apply to expressions involving literals and variables

I used to agree, but I found a different approach: make casts signal unsafe areas. That means trying to keep casts minimal. This means also accepting unsigned <-> signed conversions, even though they are unsafe for unsigned > INT_MAX. Because unsigned values are bad as general purpose types. People who use them to say "this value can't be less than zero" are misguided, at least under C semantics. They're good for optimizing storage and bit ops, but that's about it.

lerno avatar Jun 27 '25 15:06 lerno

Unless there is something to add to this I'll close it?

lerno avatar Oct 08 '25 21:10 lerno