zig Improvements for UEFI

Just run: zig build -Dtarget=x86_64-uefi -Dsingle-threaded --zig-lib-dir lib

This PR introduces several improvements for UEFI, so far:

std.process.ArgvIterator support
std.process.totalSystemMemory support
UEFI Shell Protocol
Merged @truemedian's uefi-rework branch

Planned:

std.process.EnvMap support

Mar 30 '24 00:03 RossComputerGuy

Correct me if I'm wrong, but my assumption would be that UEFI should be using WTF-16 <-> WTF-8 conversion instead of UTF-16 <-> UTF-8 (see https://github.com/ziglang/zig/pull/19005 for context).

(links to any resources about these sorts of UEFI details would be appreciated if anyone has them handy)

Either way, the doc comments of either the InvalidUtf8 or InvalidWtf8 error in ChangeCurDirError will need to be updated accordingly.

Mar 30 '24 11:03 squeek502

@squeek502 The UEFI spec mentions UTF-16 and UTF-8 but not WTF-8 or WTF-16.

Mar 30 '24 13:03 RossComputerGuy

https://uefi.org/specs/UEFI/2.10/02_Overview.html?highlight=char16#data-types

CHAR16

2-byte Character.

Unless otherwise specified all characters and strings are stored in the UCS-2 encoding format as defined by Unicode 2.1 and ISO/IEC 10646 standards.

UCS-2 is a fixed width encoding that uses two bytes for each character; meaning, it can represent up to a total of 216 characters or slightly over 65 thousand. On the other hand, UTF-16 is a variable width encoding scheme that uses a minimum of 2 bytes and a maximum of 4 bytes for each character. This lets UTF-16 represent any character in Unicode while using minimal space for the most commonly used characters. For majority of the 65,000+ characters, UCS-2 and UTF-16 have identical [code](http://www.differencebetween.net/business/structure-systems/difference-between-universal-product-code-upc-and-stock-keeping-unit-sku/) points; so they are largely equivalent. This lets UTF-16 capable applications to correctly interpret UCS-2 codes. But the other way around would not work due to the many enhancements in UTF-16.

Mar 30 '24 14:03 llogick

https://www.ibm.com/docs/en/i/7.4?topic=unicode-ucs-2-its-relationship-utf-16

The UCS-2 standard, an early version of Unicode, is limited to 65 535 characters. However, the data processing industry needs over 94 000 characters; the UCS-2 standard has been superseded by the Unicode UTF-16 standard.

So should we implement UCS-2 specific functions in the Unicode module or stick with the UTF-16 methods?

Mar 30 '24 14:03 RossComputerGuy

The situation with UEFI sounds like it is the same as the situation with Windows (arbitrary sequences of u16), so I think the right way to go is WTF-16 <-> WTF-8.

(WTF-16 is an informal name for potentially ill-formed UTF-16, or UTF-16 with potentially unpaired surrogates).

See the motivation section of the WTF-8 spec: https://simonsapin.github.io/wtf-8/#motivation and the description of https://github.com/ziglang/zig/pull/19005

Mar 30 '24 19:03 squeek502

The situation with UEFI sounds like it is the same as the situation with Windows

Yeah, well UEFI takes similarities from Windows so no surprise there.

so I think the right way to go is WTF-16 <-> WTF-8.

Has anyone tried WTF-8/16 encodings on UEFI? I've only seen UTF-8/16.

Mar 30 '24 22:03 RossComputerGuy

From the linked specs, UEFI uses UCS-2 which is essentially just sequences of u16. The only difference to UTF-16 is that well-formed UTF-16 disallows unpaired surrogate codepoints, meaning the code units 0xD800 through 0xDFFF. Because UCS-2 allows any u16 (including 0xD800 through 0xDFFF), this means that not all valid UCS-2 can be represented as UTF-16 (and also that not all valid UCS-2 can be represented as UTF-8).

WTF-16 is functionally equivalent to UCS-2 for our purposes, and WTF-8 is a superset of UTF-8 that allows the codepoints U+D800 to U+DFFF (surrogate codepoints) to be encoded using the normal UTF-8 encoding algorithm. Since U+D800 to U+DFFF are the only WTF-16 code units that are normally unrepresentable in UTF-8, this alone is sufficient to be able to losslessly roundtrip from WTF-8 to WTF-16.

If you'd like to test whether or not WTF-16 is needed, the way to go would be to create a file with any value from 0xD800 to 0xDFFF as one of the u16s in the filename. If the filesystem supports creating a file with such a name (which I'm assuming it does), then UTF-16 <-> UTF-8 is not going to cut it.

Mar 30 '24 22:03 squeek502

If you'd like to test this, the way to go would be to create a file with any value from 0xD800 to 0xDFFF as one of the u16s. If the filesystem supports creating a file with such a name (which I'm assuming it does), then UTF-16 <-> UTF-8 is not going to cut it.

Well, I can give it a try. I'm currently porting over @truemedian's UEFI stuff from their own fork so this PR will include file system operations.

Mar 30 '24 22:03 RossComputerGuy

I've merged @truemedian's Zig branch for the uefi-rework into this one and added a couple of improvements along the way.

Mar 31 '24 01:03 RossComputerGuy

Ok, so I got the compiler to build for UEFI This PR still needs a lot of work though.

Mar 31 '24 06:03 RossComputerGuy

Some progress has been made. zig env works (libc hangs) but crashes at the end, zig targets prints but crashes which an OOB error.

Apr 19 '24 19:04 RossComputerGuy

draft status for > 30 days

Jun 08 '24 20:06 andrewrk