cps icon indicating copy to clipboard operation
cps copied to clipboard

Definition of "isa" has an OS-specific vocabulary and various other corner cases

Open smcv opened this issue 1 year ago • 6 comments

The isa field is currently defined to be a possible output of uname -m, which isn't necessarily a great fit for build systems for several reasons:

  • The existence of uname -m is a Unixism: as far as I'm aware, Windows doesn't have it at all. Is there a meaningful definition of what the isa should be on Windows, to distinguish between i386, x86_64 and others?

  • Different OSs represent the same ISA in uname -m differently. For example, Darwin's arm64 is the same as Linux's aarch64 according to GNU config.guess, the conventional Windows name for what Linux calls x86_64 is x64, and PowerPC is variously powerpc{,64} or ppc{,64}.

  • Sometimes the same ISA has multiple representations even on the same OS. For example, on Linux, i386 up to i686 are all the same ISA really, and semi-arbitrary strings like armv5tel are the same ISA as arm. The current CPS spec seems to consider i586 and i686 to be distinct ISAs, and similarly arm and armv5tel: it seems bad if a CPS-based build system is encouraged to crash out with an error like "you are compiling for i686, but the version of libfoo we found was for i586".

  • Some CPUs like PowerPC and ARM can be run in two modes, little-endian (LSB first) or big-endian (MSB first); some vocabularies of CPU families represent this as part of the architecture name, and some do not. For example, Linux uname -m on 64-bit PowerPC can output either ppc64 or ppc64le, but Meson considers both of those to be members of the ppc64 CPU family. At the moment CPS seems to consider ppc64 and ppc64le to be distinct, but it isn't clear whether this is really intentional.

(See GNU's /usr/share/misc/config.guess and /usr/share/misc/config.sub on a Linux system for many more examples of the output of uname -m needing normalization or postprocessing.)

If the ISA is important information to appear in these files, I'd suggest having a normative vocabulary of architecture names, like Meson does: https://mesonbuild.com/Reference-tables.html#cpu-families (the table ends with "Any cpu family not listed in the above list is not guaranteed to remain stable in future releases").

Defining the OS as being uname -s has many of the same issues.

smcv avatar Sep 27 '24 18:09 smcv

If the ISA is important information to appear in these files

...I think so? If I'm building for ia64 (why? :wink:) and I find a package built for ppc64, I'm not going to be able to link that, am I?

That said, platform compatibility is an area that's known to need a complete overhaul, so please don't hold your breath expecting rapid progress. However, I think the idea of having an explicit registry has merit.

mwoehlke avatar Oct 21 '24 20:10 mwoehlke

I really, really, really want to have this. The number of issues we've fielded in Meson that turned out to be "my pkg-config picked up a .pc file for my build machine on a host machine target" is enough to make me pull my hair out.

I'm obviously biased, but the tables approach has worked fairly well for Meson so far.

dcbaker avatar Oct 22 '24 04:10 dcbaker

I've been working on a solution for these issues as part of the EcoIS, which is two fold. The first is bringing back P1864 so that CPS could at least have an idea of "common" names to use for ISAs.

The second is having a superset of CPS configurations folded into a well known directory layout much akin to Apple's .xcframework.

Unfortunately sudden health issues and work requirements have resulted in almost no time allowed for working on these :(

I would argue just listing an ISA is not enough. A full target tuplet is necessary to know what a package does (e.g., "does this C package use the Windows calling convention") so that a build system can select the correct option. This also opens CPS up to allow platforms that would be considered old and dead (e.g., the SNES), and this would be a boon for retrocomputing as both a hobby via homebrew but also as a field of study for older compiler toolchains.

bruxisma avatar Oct 23 '24 17:10 bruxisma

I would argue just listing an ISA is not enough

Heartily seconded. You don't want to find a Windows package when building for Linux... and "windows"/"linux" are probably not adequate, either, for their axis. "Platform" is a many-dimensional concept for which most of the axes matter.

The trick is figuring out a) what the axes are, and b) what the set of possible values is for each. Personally, I'm not convinced a tuple is the right data structure. CPS, as it stands, is using the equivalent of a dictionary.

mwoehlke avatar Oct 23 '24 20:10 mwoehlke

☝🤓 well akshually (forgive me for that, but also don't)

a tuple is just a dictionary where the indices are the keys of a dictionary, and if those indices are tied to a name, it's just a named tuple, and a named tuple is just a dictionary.

bruxisma avatar Oct 24 '24 00:10 bruxisma

I find it to be really unclear what the intended purpose of the "isa" field is here. I think that as currently defined, the field should not be used as anything besides a human-readable informational note.

  • Instruction sets being different doesn't mean that code can't be linked. For example, i586 and i686 code can be linked, but unless the application performs run-time cpu detection to choose whether to use the i686 code, the result will not run on an i586 cpu.
  • Instruction sets being the same doesn't mean that code can be linked. For example, a system that uname -m reports as "riscv64", and which has the isa string "rv64imafd" as defined by RISC-V specs, could use either the "LP64" or "LP64D" abis - but trying to mix code with those two abis will result in an error at link time.

If you want to report the "general cpu architecture", then you need a specified vocabulary which combines compatible architectures into a single value. For example, use "x86" for all 32-bit intel i386 compatible cpus.

If you want to report "the instruction set used by the compiled code", then you need a much more complicated system which can specify not only the specific isa, but also any isa extensions in use. For example, "i686+mmx+sse+sse2", "x86-64-v3", "rv64imafdcv_zifencei_zihintpause_zkt_zvkt".

If you want to report whether code is compatible for linking, then you need to return values that represent the linking abi, which specifies things like stack layout, sizes of basic types, and which registers are used in function calls for parameters and return values. For ELF platforms, these are defined by the "psABI" documents for a specific processor architecture. For example, on X86-64 it can be either "LP64" or "ILP32" (sometimes called "x32"), and RISC-V defines "LP64", "LP64F", "LP64D", "LP64Q" for 64bit code (different floating point parameter passing conventions), plus a set for 32bit code, and some additional ones in draft stage.

(Note that in some cases you can use manually specified compiler flags to generate code which is not compatible with a standard ABI, for example by using -msseregparm to pass floating point values in SSE registers instead of x87 registers on 32-bit x86 - I'm not sure whether and how it would make sense to support arbitrary things like that.)

kepstin avatar Feb 21 '25 01:02 kepstin