zig icon indicating copy to clipboard operation
zig copied to clipboard

Improvements for UEFI

Open RossComputerGuy opened this issue 1 year ago • 11 comments

Just run: zig build -Dtarget=x86_64-uefi -Dsingle-threaded --zig-lib-dir lib

This PR introduces several improvements for UEFI, so far:

  • std.process.ArgvIterator support
  • std.process.totalSystemMemory support
  • UEFI Shell Protocol
  • Merged @truemedian's uefi-rework branch

Planned:

  • std.process.EnvMap support

RossComputerGuy avatar Mar 30 '24 00:03 RossComputerGuy

Correct me if I'm wrong, but my assumption would be that UEFI should be using WTF-16 <-> WTF-8 conversion instead of UTF-16 <-> UTF-8 (see https://github.com/ziglang/zig/pull/19005 for context).

(links to any resources about these sorts of UEFI details would be appreciated if anyone has them handy)

Either way, the doc comments of either the InvalidUtf8 or InvalidWtf8 error in ChangeCurDirError will need to be updated accordingly.

squeek502 avatar Mar 30 '24 11:03 squeek502

@squeek502 The UEFI spec mentions UTF-16 and UTF-8 but not WTF-8 or WTF-16.

RossComputerGuy avatar Mar 30 '24 13:03 RossComputerGuy

https://uefi.org/specs/UEFI/2.10/02_Overview.html?highlight=char16#data-types

CHAR16

2-byte Character.

Unless otherwise specified all characters and strings are stored in the UCS-2 encoding format as defined by Unicode 2.1 and ISO/IEC 10646 standards.
UCS-2 is a fixed width encoding that uses two bytes for each character; meaning, it can represent up to a total of 216 characters or slightly over 65 thousand. On the other hand, UTF-16 is a variable width encoding scheme that uses a minimum of 2 bytes and a maximum of 4 bytes for each character. This lets UTF-16 represent any character in Unicode while using minimal space for the most commonly used characters. For majority of the 65,000+ characters, UCS-2 and UTF-16 have identical [code](http://www.differencebetween.net/business/structure-systems/difference-between-universal-product-code-upc-and-stock-keeping-unit-sku/) points; so they are largely equivalent. This lets UTF-16 capable applications to correctly interpret UCS-2 codes. But the other way around would not work due to the many enhancements in UTF-16.

llogick avatar Mar 30 '24 14:03 llogick

https://www.ibm.com/docs/en/i/7.4?topic=unicode-ucs-2-its-relationship-utf-16

The UCS-2 standard, an early version of Unicode, is limited to 65 535 characters. However, the data processing industry needs over 94 000 characters; the UCS-2 standard has been superseded by the Unicode UTF-16 standard.

So should we implement UCS-2 specific functions in the Unicode module or stick with the UTF-16 methods?

RossComputerGuy avatar Mar 30 '24 14:03 RossComputerGuy

The situation with UEFI sounds like it is the same as the situation with Windows (arbitrary sequences of u16), so I think the right way to go is WTF-16 <-> WTF-8.

(WTF-16 is an informal name for potentially ill-formed UTF-16, or UTF-16 with potentially unpaired surrogates).

See the motivation section of the WTF-8 spec: https://simonsapin.github.io/wtf-8/#motivation and the description of https://github.com/ziglang/zig/pull/19005

squeek502 avatar Mar 30 '24 19:03 squeek502

The situation with UEFI sounds like it is the same as the situation with Windows

Yeah, well UEFI takes similarities from Windows so no surprise there.

so I think the right way to go is WTF-16 <-> WTF-8.

Has anyone tried WTF-8/16 encodings on UEFI? I've only seen UTF-8/16.

RossComputerGuy avatar Mar 30 '24 22:03 RossComputerGuy

From the linked specs, UEFI uses UCS-2 which is essentially just sequences of u16. The only difference to UTF-16 is that well-formed UTF-16 disallows unpaired surrogate codepoints, meaning the code units 0xD800 through 0xDFFF. Because UCS-2 allows any u16 (including 0xD800 through 0xDFFF), this means that not all valid UCS-2 can be represented as UTF-16 (and also that not all valid UCS-2 can be represented as UTF-8).

WTF-16 is functionally equivalent to UCS-2 for our purposes, and WTF-8 is a superset of UTF-8 that allows the codepoints U+D800 to U+DFFF (surrogate codepoints) to be encoded using the normal UTF-8 encoding algorithm. Since U+D800 to U+DFFF are the only WTF-16 code units that are normally unrepresentable in UTF-8, this alone is sufficient to be able to losslessly roundtrip from WTF-8 to WTF-16.

If you'd like to test whether or not WTF-16 is needed, the way to go would be to create a file with any value from 0xD800 to 0xDFFF as one of the u16s in the filename. If the filesystem supports creating a file with such a name (which I'm assuming it does), then UTF-16 <-> UTF-8 is not going to cut it.

squeek502 avatar Mar 30 '24 22:03 squeek502

If you'd like to test this, the way to go would be to create a file with any value from 0xD800 to 0xDFFF as one of the u16s. If the filesystem supports creating a file with such a name (which I'm assuming it does), then UTF-16 <-> UTF-8 is not going to cut it.

Well, I can give it a try. I'm currently porting over @truemedian's UEFI stuff from their own fork so this PR will include file system operations.

RossComputerGuy avatar Mar 30 '24 22:03 RossComputerGuy

I've merged @truemedian's Zig branch for the uefi-rework into this one and added a couple of improvements along the way.

RossComputerGuy avatar Mar 31 '24 01:03 RossComputerGuy

Ok, so I got the compiler to build for UEFI image This PR still needs a lot of work though.

RossComputerGuy avatar Mar 31 '24 06:03 RossComputerGuy

Some progress has been made. zig env works (libc hangs) but crashes at the end, zig targets prints but crashes which an OOB error.

RossComputerGuy avatar Apr 19 '24 19:04 RossComputerGuy

draft status for > 30 days

andrewrk avatar Jun 08 '24 20:06 andrewrk