icu
icu copied to clipboard
ICU-22838 Add WebAssembly/WASI cross-compilation support
Status: Blocked by std::mutex and std::atomic availability in WASI-SDK. (https://github.com/WebAssembly/wasi-sdk/pull/498 might be a solution we need but still need some discussions)
With this patch, we can cross-compile ICU4C with wasi-sdk as follows:
$ mkdir tools
$ mkdir cross
$ cd tools
$ ../../icu/icu4c/source/configure && make
$ cd ../cross
$ ../../icu/icu4c/source/configure --host=wasm32-unknown-wasi \
--with-cross-build="$PWD/../tools" \
--enable-static --disable-shared \
--disable-tools --disable-tests \
--disable-samples --disable-extras \
--with-data-packaging=static \
--prefix / \
CC="$WASI_SDK_PATH/bin/clang" CXX="$WASI_SDK_PATH/bin/clang++" \
AR="$WASI_SDK_PATH/bin/llvm-ar" RANLIB="$WASI_SDK_PATH/bin/llvm-ranlib" \
CFLAGS="--sysroot $WASI_SDK_PATH/share/wasi-sysroot -D_WASI_EMULATED_SIGNAL" \
CXXFLAGS="-fno-exceptions --sysroot $WASI_SDK_PATH/share/wasi-sysroot -D_WASI_EMULATED_SIGNAL"
$ make PKGDATA_OPTS="--without-assembly -O $PWD/data/icupkg.inc" -j16
Sanity check
$ cat sample.c
#include <stdio.h>
#include <unicode/ucnv.h>
#include <unicode/ustring.h>
int main() {
UErrorCode status = U_ZERO_ERROR;
UConverter *cnv = ucnv_open("UTF-8", &status);
if (U_FAILURE(status)) {
fprintf(stderr, "ucnv_open failed: %s\n", u_errorName(status));
return 1;
}
char buffer[1024];
int length = fread(buffer, 1, sizeof(buffer), stdin);
if (length < 0) {
fprintf(stderr, "fread failed\n");
return 1;
}
UChar uchars[1024];
int32_t uchars_length = ucnv_toUChars(cnv, uchars, sizeof(uchars) / sizeof(UChar), buffer, length, &status);
if (U_FAILURE(status)) {
fprintf(stderr, "ucnv_toUChars failed: %s\n", u_errorName(status));
return 1;
}
int32_t char_count = u_countChar32(uchars, uchars_length);
printf("%d\n", char_count);
return 0;
}
$ $WASI_SDK_PATH/bin/clang sample.c -I ../install/include/ -L ../install/lib/ -licuuc -licudata
$ echo "Pokémon" | wasmtime ./a.out
8
Patch overview
-
Add config/mh-wasi and update configure script.
WebAssembly/WASI does not support threads and dynamic linking, so the new mh-wasi file disables those features in ICU. The new config file
mh-wasiis used when the target matches*-wasi*, which includeswasm32-unknown-wasi,wasm32-unknown-wasip1, and etc. -
Teach platform.h not to define U_TZSET, U_TIMEZONE, and U_TZNAME for WASI
WASI does not define timezone-related functionalities, and wasi-libc does not provide tzset, timezone, and tzname. This change teaches platform.h not to define U_TZSET, U_TIMEZONE, and U_TZNAME for WASI.
-
Add
U_HAVE_ATOMICSand support platforms without atomics like WASIThis change adds a new macro
U_HAVE_ATOMICSto ICU. This macro is always defined to 1 except for platforms that do not support atomic operations like WASI. Due to the lack of threading support in WASI, the wasi-sdk does not providestd::atomic,std::mutex, and related types, so the new macro is used to disable the use of these types.
Checklist
- [x] Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22838
- [x] Required: The PR title must be prefixed with a JIRA Issue number.
- [x] Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
- [x] Required: Each commit message must be prefixed with a JIRA Issue number.
- [x] Issue accepted (done by Technical Committee after discussion)
- [ ] Tests included, if applicable
- [ ] API docs and/or User Guide docs changed or added, if applicable
Notice: the branch changed across the force-push!
- icu4c/source/common/putilimp.h is different
- icu4c/source/common/unicode/platform.h is now changed in the branch
- icu4c/source/i18n/unicode/numberrangeformatter.h is different
~ Your Friendly Jira-GitHub PR Checker Bot
Notice: the branch changed across the force-push!
- icu4c/source/i18n/number_mapper.h is different
- icu4c/source/i18n/numrange_fluent.cpp is different
~ Your Friendly Jira-GitHub PR Checker Bot
@srl295 @sffc could you help review these changes?
Please rebase on current main and resolve conflicts. Note that the autoconf scripts have been updated in another PR. I will try to be better about nudging pull requests along...
Notice: the branch changed across the force-push!
- icu4c/source/common/unicode/platform.h is different
- icu4c/source/common/unifiedcache.cpp is different
- icu4c/source/configure is different
- icu4c/source/i18n/numrange_fluent.cpp is different
~ Your Friendly Jira-GitHub PR Checker Bot
@markusicu Thanks for updating autoconf stuff. I rebased the branch on the latest main.
Notice: the branch changed across the force-push!
- icu4c/source/common/umutex.cpp is different
- icu4c/source/common/umutex.h is different
- icu4c/source/common/unifiedcache.cpp is different
~ Your Friendly Jira-GitHub PR Checker Bot
@kateinoigakukun please don't squash any more until this PR is approved, to help GitHub track comments.
@aheninger Thank you for your feedback. They totally make sense to me. Addressed them in 008b01c3c60467699950c7e65b82e8131a74dd90
@markusicu Oops, sorry. I'll keep separate fixup commits for further changes.
Hooray! The files in the branch are the same across the force-push. 😃
~ Your Friendly Jira-GitHub PR Checker Bot
Are you aware of ICU4X, which we've built from the ground up to work well with WebAssembly? It is no_std so doesn't even need WASI in order to build.
sffc wrote:
I'm not a fan of U_HAVE_ATOMICS poking so many holes in the ICU4C code base.
Me neither. I was assuming getting ICU running in this environment was considered worth the added complexity.
Thanks @sffc for your feedback
Are you aware of ICU4X, which we've built from the ground up to work well with WebAssembly? It is no_std so doesn't even need WASI in order to build.
Yes, I'm aware of the ground-up work. For our needs, because the Foundation project heavily depends on ICU4C, the migration might not be trivial and take a long time to get all things stable (e.g. API migrations especially for APIs without no drop-in replacement like IDNA, build system integration with Rust toolchain, and so on). At least, it's hard to reason the migration decision just for WASI platform support. But I personally agree it might be a direction we should take in the long term for various other advantages of ICU4X.
It looks like wasi-threads is still stage 1; is that why you can't use it?
First, wasi-threads is a proposal to provide thread creation APIs and not to provide synchronization primitives. And also the proposal is considered legacy and to be replaced with another proposal (but no implementation for the future proposal yet)
The synchronization primitives are provided by a core spec proposal WebAssembly/threads.
Under the circumstances, WASI-SDK currently supports two platform variations:
- Only WASI Preview1
- Pre-compiled libc and libcxx are compiled without
WebAssembly/threadsfeature, no synchroniztion primitive APIs
- Pre-compiled libc and libcxx are compiled without
- WASI Preview1 + wasi-threads extension.
- Pre-compiled libc and libcxx provide synchronization primive APIs
Given that wasi-threads is today consiered legacy, both of them don't satisfy the needs here.
So what we want here is synchrnoization primitives but not thread creation (WebAssembly/threads enabled but wasi-threads disabled). However, both WASI-SDK and even libcxx do not support such platform, "synchronization primitives available but no thread creation" for now.
I understand your concern about poking holes, and it's a typical pain point when porting existing software to WebAssembly. To fill the gap, there is an ongoing effort to provide stubs for synchronization primitive APIs without actual synchronization operations (https://github.com/WebAssembly/wasi-libc/pull/518) in wasi-libc and WASI-SDK side. However, it requires some more effort to bring libcxx's std::mutex and std::atomic, so we don't have a fixed timeline in this direction.
A better solution would be to compile only a subset of ICU4C that doesn't use atomics (I know of clients already doing this), and anything that needs atomics just doesn't get compiled.
Yes, we considered that direction, however the Foundation projects depends on many ICU features including formatters, locales, and timezones and they all use umutex.h, so it's not a realistic approach for us unfortunately.
Thanks @sffc for your feedback
Are you aware of ICU4X, which we've built from the ground up to work well with WebAssembly? It is no_std so doesn't even need WASI in order to build.
Yes, I'm aware of the ground-up work. For our needs, because the Foundation project heavily depends on ICU4C, the migration might not be trivial and take a long time to get all things stable (e.g. API migrations especially for APIs without no drop-in replacement like IDNA, build system integration with Rust toolchain, and so on).
From browsing the project, it looks like most calls of ICU go through here:
https://github.com/swiftlang/swift-corelibs-foundation/blob/main/Sources/CoreFoundation/internalInclude/CFICULogging.h
The bulk of the functions in that list have ICU4X equivalents, including calendrical calculations, datetime formatting, custom datetime patterns and symbols, number formatting, relative datetime formatting (experimental in ICU4X), and list formatting. The only function in that list that sticks out where there's not yet an equivalent is number parsing. You mentioned IDNA; I don't see IDNA in that list, but ICU4X aims to export at least the primitives for IDNA (https://github.com/unicode-org/icu4x/issues/2614).
I can't comment on your concern about Rust toolchains except to observe that other projects with large amounts of legacy C/C++ code and complex build systems have adopted it in the last two years (e.g. Chromium). Having worked on ICU4C for 5 years before working on ICU4X for 3 years, I feel much more confident in Rust's ability to deliver fast, reliable code in the i18n space.
At least, it's hard to reason the migration decision just for WASI platform support. But I personally agree it might be a direction we should take in the long term for various other advantages of ICU4X.
I don't think a migration would be too difficult, though it would be most successful with a Foundation contributor driving the process in collaboration with the ICU4X team who could fill in any potential gaps through the process.
My most recent talk on ICU4X, at UTW 2023, goes over some of the metrics, showing how the team has delivered better code size, memory usage, and performance in most components, especially formatters, which seems to be one of the main components you use in Foundation.
I hope to be giving off a healthy sense of enthusiasm in being a point of contact if you decide to go this route.
I understand your concern about poking holes, and it's a typical pain point when porting existing software to WebAssembly. To fill the gap, there is an ongoing effort to provide stubs for synchronization primitive APIs without actual synchronization operations (WebAssembly/wasi-libc#518) in wasi-libc and WASI-SDK side. However, it requires some more effort to bring libcxx's
std::mutexandstd::atomic, so we don't have a fixed timeline in this direction.
I see; so it sounds like there is likely room to continue making progress in this space in WASI itself. Do you think once WASI lands support for std::mutex and std::atomic, you might be able to revert this workaround in ICU4C?
Also, it seems like there should already exist polyfills for std::mutex and std::atomic. Can we land a copy of those in ICU4C (assuming compatible licenses) to avoid reinventing the wheel?
However, it requires some more effort to bring libcxx's
std::mutexandstd::atomic, so we don't have a fixed timeline in this direction.
fwiw, if https://github.com/WebAssembly/wasi-libc/pull/518 manages to achieve consensus and get merged, the only thing that has to be done to enable std::mutex and std::atomic is changing a few lines of cmake to ON in WASI-SDK (achieving consensus on that may or may not require yet another discussion, but it's not a difficult technical problem)
fwiw, if https://github.com/WebAssembly/wasi-libc/pull/518 manages to achieve consensus and get merged, the only thing that has to be done to enable std::mutex and std::atomic is changing a few lines of cmake to ON in WASI-SDK (achieving consensus on that may or may not require yet another discussion, but it's not a difficult technical problem)
This is probably the best option as far as upstreaming changes to ICU4C.
I'll back to this once a new wasi-sdk including the pthread stub +libcxx atomic support will be shipped.
Notice: the branch changed across the force-push!
- icu4c/source/acinclude.m4 is different
- icu4c/source/common/putilimp.h is different
- icu4c/source/common/umutex.cpp is no longer changed in the branch
- icu4c/source/common/umutex.h is no longer changed in the branch
- icu4c/source/common/unicode/platform.h is no longer changed in the branch
- icu4c/source/common/unifiedcache.cpp is no longer changed in the branch
- icu4c/source/configure is different
- icu4c/source/i18n/decimfmt.cpp is no longer changed in the branch
- icu4c/source/i18n/number_mapper.h is no longer changed in the branch
- icu4c/source/i18n/numrange_fluent.cpp is no longer changed in the branch
- icu4c/source/i18n/unicode/numberrangeformatter.h is no longer changed in the branch
~ Your Friendly Jira-GitHub PR Checker Bot
Notice: the branch changed across the force-push!
- .github/workflows/icu4c.yml is different
~ Your Friendly Jira-GitHub PR Checker Bot
Notice: the branch changed across the force-push!
- .github/workflows/icu4c.yml is different
~ Your Friendly Jira-GitHub PR Checker Bot
Hi @sffc @srl295, I updated my patch based on recent WASI SDK improvements.
- https://github.com/unicode-org/icu/pull/3067/commits/06d4d9ed360b00c08931f6849fdedce8ddb436a9 Thanks to @ArcaneNibble and @whitequark's works,
std::mutexandstd::atomicare now available forwasm32-unknown-wasip1target, so the patch is now significantly simplified. (See https://github.com/WebAssembly/wasi-sdk/issues/546 for the background) - https://github.com/unicode-org/icu/pull/3067/commits/a76738531dc0960f13f69db831e245dbb8a79a57 Added a new CI job to ensure a build with WASI-SDK is healthy
@kateinoigakukun i tried the patches on the main, and it complains regarding U_NL_LANGINFO_CODESET in putil.cpp
However adding -DU_HAVE_NL_LANGINFO_CODESET=0 to CXXFLAGS, got rid of the usage and then everything builds fine. Thanks for the work on the PR.
@gb-jos Thank you for trying this :) I locally rebased on the top of the latest main with wasi sdk 27 but I couldn't reproduce it 🤔
@sffc @srl295 Friendly reminder in case this got buried, no rush, just wanted to keep it on your radar!
@sffc @srl295 Friendly reminder in case this got buried, no rush, just wanted to keep it on your radar!
Can you sign the cla ?
Notice: the branch changed across the force-push!
- .github/workflows/icu4c.yml is different
- icu4c/source/configure is different
~ Your Friendly Jira-GitHub PR Checker Bot
@srl295 Thanks! Signed CLA and squashed commits to pass single-commit check.