rust-bindgen icon indicating copy to clipboard operation
rust-bindgen copied to clipboard

[Feature Request] Keep the integer literal radices of C and C++ in generated Rust

Open miikkas opened this issue 6 months ago • 8 comments

It would be great if the Rust code generated by bindgen would keep the number bases of the integer literals in the input C and C++ code. After all, the writer of the C or C++ code probably had a good reason for using a specific radix.

C23[^1], C++14[^2], and Rust[^3] all support binary, octal, and hexadecimal base numbers, with C23 and C++14 sharing the same syntax. A one-to-one mapping with Rust seems to be possible.

For example, given the following code working in both C23 and C++14:

#define BIN_LIT 0b10
#define OCT_LIT 010
#define HEX_LIT 0x10

the generated Rust bindings would be:

pub const BIN_LIT: u32 = 0b10;
pub const OCT_LIT: u32 = 0o10;
pub const HEX_LIT: u32 = 0x10;

In addition to the #defined constants, the number bases of e.g. consts and enums would be kept as well.

A bit of not too serious motivation

Just for the reference, if someone finds this interesting. I ran the following command in a fresh git clone of the Linux kernel repository, finding lines in .h files only that have a #define with a hexadecimal literal later on that line:

rg "\#define.+\b0x" --type h | wc -l

Currently, that's 4 519 652 such lines.

@rustbot label enhancement

[^1]: ISO/IEC 9899:2024 (en), draft N3220, chapter 6.4.4.2 Integer constants [^2]: ISO/IEC 14882:2014, draft N3797, chapter 2.14.2 Integer literals [^3]: Rust Reference: 2.6 Tokens, Number literals, 8.2.1 Literal expressions, Integer literals expressions

miikkas avatar Jun 18 '25 17:06 miikkas

I have actually made an attempt at a prototype for this, which I'll submit as a PR shortly.

edit: PR with a ready solution, (hopefully!) better than prototype quality, is submitted.

miikkas avatar Jun 18 '25 17:06 miikkas

This feature would presumably be somewhat invasive: Many projects updating bindgen would possibly need to update lots of generated bindings with literal values changed to use the original radix. This, in turn, might lead to a lot of work in those projects to review and test changes, or cause them to postpone updating bindgen.

If that is a concern, to reduce churn and allow some breathing room for projects using bindgen, the introduction of the feature could be phased by adding an option to Builder that would default to disabling the feature in the first version where it is introduced, then moving on to defaulting to enabling it in the next version, then deprecating the option in the next, and so on. What do you think?

miikkas avatar Jul 12 '25 06:07 miikkas

Even if it's not, the default I would like this to be an option. A lot times if it's not a base 10 value, it's used to show some pattern in the literals.

Also the ability to specify the base/radix would be nice.

Keith-Cancel avatar Jul 21 '25 05:07 Keith-Cancel

Even if it's not, the default I would like this to be an option.

If you'd like to try this out, feel free to check my implementation in PR https://github.com/rust-lang/rust-bindgen/pull/3237. :) I found that it seems to work well on some code used at $dayjob.

miikkas avatar Jul 21 '25 12:07 miikkas

On a more general level, I think this feature is connected with the question of how much of the underlying code should be evaluated, and how much should be left as is. The philosophy of retaining the radix of integer literals leans a bit towards keeping some of the original input untouched.

That said, what is the correct radix of an integer literal in the generated Rust bindings?

Overall, I think the scope of the problem should be restricted to retaining a radix used originally, or failing that, falling back to using decimal by default.

Below, I have non-exhaustively collected some cases that attempt to answer the question from a human point of view. The conclusions may differ from the other important point of view of how difficult it would be for software to arrive at the preferred human decision.

Straightforward

In these cases, we should just keep the radix of the original literal(s) appearing in the definition.

A constant is defined directly as some literal:

#define MY_LITERAL 0xF00D

A constant is defined as a literal wrapped in parentheses or prefixed with an unary operator:

#define NEG_LITERAL (-0606)
#define NOT_LITERAL (~0b11101)

Straightforward, But Harder for Computers

In these cases, we should keep the radix of the original literal(s) appearing in the definition as well.

(Based on @ojeda's comment below.) A literal with a cast:

#define MY_INT (int)42

An evaluated constant is based on a calculation involving several different integer literals of the same radix:

#define MY_HEX (0x10 + 0x20)

An evaluated constant is the same as some other constant:

#define MY_OLD_OCT 0123
#define MY_NEW_OCT MY_OLD_OCT

An evaluated constant is the result of a calculation made with other constants, all of which have the same radix:

#define LIT_A 0x10
#define LIT_B 0x20
#define LIT_C (LIT_A + LIT_B)

A function-like macro with a single literal operand is used:

#define SQUARE(a) ((a)*(a))
#define MY_SQUARE SQUARE(040)

A function-like macro with a single operand defined as a constant is used:

#define SQUARE(a) ((a)*(a))
#define MY_SIDE 040
#define MY_SQUARE SQUARE(MY_SIDE)

A function-like macro is used, where all the literal operands have the same radix:

#define MAX(a,b) (((a)>(b))?(a):(b))
#define MY_GREATER MAX(0xAB10, 0xABA0)

More Difficult

In these cases, it's not clear-cut what the radix of the literal in the bindings should be.

A specific kind of calculation with mixed radices:

#define SHIFTED (0b11 << 1)
  • In the above case, it seems to me that outputting the result 0b110 using binary radix would make sense, since it feels like the value 0b11 is the target of being operated on instead of operating on something like the following value 1.

A calculation with mixed radices:

#define RESULT (0b11 * 0xEE)

A function-like macro is used, where the literal operands have the mixed radices:

#define MAX(a,b) (((a)>(b))?(a):(b))
#define MY_GREATER MAX(0xAB10, 012345)

miikkas avatar Aug 21 '25 07:08 miikkas

I think keeping the prefix in the trivial cases is fine as a best-effort feature if it is easy to support, but any other case likely gets complex to be worth it, i.e. requires parsing C expressions (at that point, one could even start to consider transpiling some expressions too, to keep the "original input untouched" to some degree).

In addition, bindgen currently does not provide the value for non-trivial macros anyway (like your MY_GREATER example or e.g.(int)42). There is --clang-macro-fallback as a workaround to do so, but uses the compiler to compute them, which is good.

ojeda avatar Aug 21 '25 09:08 ojeda

I think keeping the prefix in the trivial cases is fine as a best-effort feature if it is easy to support, but any other case likely gets complex to be worth it, i.e. requires parsing C expressions (at that point, one could even start to consider transpiling some expressions too, to keep the "original input untouched" to some degree).

Yeah, I definitely agree. My current version of the feature in the PR is exactly a best-effort implementation, which is already sufficient to support retaining non-decimal radices in 13 of the existing test header files.

It's a bit annoying that some of the cases that are non-trivial for computers would be quite cumbersome to implement, as it may give an unpolished impression. However, I think this is mitigated by starting with the opt-in approach of requiring a specific Builder option/CLI flag. I guess a full-blown C pre-processor + compiler library in Rust would help...

miikkas avatar Aug 21 '25 18:08 miikkas

I guess a full-blown C pre-processor + compiler library in Rust would help...

Yeah, adding one has been discussed in the past (mainly to get values from non-trivial macros, rather than readability, as far as I remember).

For readability it may be OK since there is "no risk" (apart from maintenance burden, compile-time, etc. I guess), but for computing actual values it can be quite risky, because one needs to ensure one behaves exactly like the compiler would, including compiler flags etc.

ojeda avatar Aug 21 '25 18:08 ojeda