icu icon indicating copy to clipboard operation
icu copied to clipboard

ICU-22838 Add WebAssembly/WASI cross-compilation support

Open kateinoigakukun opened this issue 1 year ago • 17 comments

Status: Blocked by std::mutex and std::atomic availability in WASI-SDK. (https://github.com/WebAssembly/wasi-sdk/pull/498 might be a solution we need but still need some discussions)

With this patch, we can cross-compile ICU4C with wasi-sdk as follows:

$ mkdir tools
$ mkdir cross
$ cd tools
$ ../../icu/icu4c/source/configure && make
$ cd ../cross
$ ../../icu/icu4c/source/configure --host=wasm32-unknown-wasi \
            --with-cross-build="$PWD/../tools" \
            --enable-static --disable-shared \
            --disable-tools --disable-tests \
            --disable-samples --disable-extras \
            --with-data-packaging=static \
            --prefix / \
            CC="$WASI_SDK_PATH/bin/clang" CXX="$WASI_SDK_PATH/bin/clang++" \
            AR="$WASI_SDK_PATH/bin/llvm-ar" RANLIB="$WASI_SDK_PATH/bin/llvm-ranlib" \
            CFLAGS="--sysroot $WASI_SDK_PATH/share/wasi-sysroot -D_WASI_EMULATED_SIGNAL" \
            CXXFLAGS="-fno-exceptions --sysroot $WASI_SDK_PATH/share/wasi-sysroot -D_WASI_EMULATED_SIGNAL"
$ make PKGDATA_OPTS="--without-assembly -O $PWD/data/icupkg.inc" -j16
Sanity check
$ cat sample.c
#include <stdio.h>
#include <unicode/ucnv.h>
#include <unicode/ustring.h>

int main() {
    UErrorCode status = U_ZERO_ERROR;
    UConverter *cnv = ucnv_open("UTF-8", &status);
    if (U_FAILURE(status)) {
        fprintf(stderr, "ucnv_open failed: %s\n", u_errorName(status));
        return 1;
    }

    char buffer[1024];
    int length = fread(buffer, 1, sizeof(buffer), stdin);
    if (length < 0) {
        fprintf(stderr, "fread failed\n");
        return 1;
    }

    UChar uchars[1024];
    int32_t uchars_length = ucnv_toUChars(cnv, uchars, sizeof(uchars) / sizeof(UChar), buffer, length, &status);
    if (U_FAILURE(status)) {
        fprintf(stderr, "ucnv_toUChars failed: %s\n", u_errorName(status));
        return 1;
    }

    int32_t char_count = u_countChar32(uchars, uchars_length);
    printf("%d\n", char_count);

    return 0;
}

$ $WASI_SDK_PATH/bin/clang sample.c -I ../install/include/ -L ../install/lib/ -licuuc -licudata
$ echo "Pokémon" | wasmtime ./a.out
8

Patch overview

  • Add config/mh-wasi and update configure script.

    WebAssembly/WASI does not support threads and dynamic linking, so the new mh-wasi file disables those features in ICU. The new config file mh-wasi is used when the target matches *-wasi*, which includes wasm32-unknown-wasi, wasm32-unknown-wasip1, and etc.

  • Teach platform.h not to define U_TZSET, U_TIMEZONE, and U_TZNAME for WASI

    WASI does not define timezone-related functionalities, and wasi-libc does not provide tzset, timezone, and tzname. This change teaches platform.h not to define U_TZSET, U_TIMEZONE, and U_TZNAME for WASI.

  • Add U_HAVE_ATOMICS and support platforms without atomics like WASI

    This change adds a new macro U_HAVE_ATOMICS to ICU. This macro is always defined to 1 except for platforms that do not support atomic operations like WASI. Due to the lack of threading support in WASI, the wasi-sdk does not provide std::atomic, std::mutex, and related types, so the new macro is used to disable the use of these types.

Checklist
  • [x] Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22838
  • [x] Required: The PR title must be prefixed with a JIRA Issue number.
  • [x] Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • [x] Required: Each commit message must be prefixed with a JIRA Issue number.
  • [x] Issue accepted (done by Technical Committee after discussion)
  • [ ] Tests included, if applicable
  • [ ] API docs and/or User Guide docs changed or added, if applicable

kateinoigakukun avatar Jul 25 '24 17:07 kateinoigakukun

Notice: the branch changed across the force-push!

  • icu4c/source/common/putilimp.h is different
  • icu4c/source/common/unicode/platform.h is now changed in the branch
  • icu4c/source/i18n/unicode/numberrangeformatter.h is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Notice: the branch changed across the force-push!

  • icu4c/source/i18n/number_mapper.h is different
  • icu4c/source/i18n/numrange_fluent.cpp is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@srl295 @sffc could you help review these changes?

markusicu avatar Aug 01 '24 16:08 markusicu

Please rebase on current main and resolve conflicts. Note that the autoconf scripts have been updated in another PR. I will try to be better about nudging pull requests along...

markusicu avatar Sep 23 '24 20:09 markusicu

Notice: the branch changed across the force-push!

  • icu4c/source/common/unicode/platform.h is different
  • icu4c/source/common/unifiedcache.cpp is different
  • icu4c/source/configure is different
  • icu4c/source/i18n/numrange_fluent.cpp is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@markusicu Thanks for updating autoconf stuff. I rebased the branch on the latest main.

kateinoigakukun avatar Sep 23 '24 21:09 kateinoigakukun

Notice: the branch changed across the force-push!

  • icu4c/source/common/umutex.cpp is different
  • icu4c/source/common/umutex.h is different
  • icu4c/source/common/unifiedcache.cpp is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@kateinoigakukun please don't squash any more until this PR is approved, to help GitHub track comments.

markusicu avatar Sep 23 '24 23:09 markusicu

@aheninger Thank you for your feedback. They totally make sense to me. Addressed them in 008b01c3c60467699950c7e65b82e8131a74dd90

@markusicu Oops, sorry. I'll keep separate fixup commits for further changes.

kateinoigakukun avatar Sep 23 '24 23:09 kateinoigakukun

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

Are you aware of ICU4X, which we've built from the ground up to work well with WebAssembly? It is no_std so doesn't even need WASI in order to build.

sffc avatar Sep 24 '24 03:09 sffc

sffc wrote:

I'm not a fan of U_HAVE_ATOMICS poking so many holes in the ICU4C code base.

Me neither. I was assuming getting ICU running in this environment was considered worth the added complexity.

aheninger avatar Sep 24 '24 04:09 aheninger

Thanks @sffc for your feedback

Are you aware of ICU4X, which we've built from the ground up to work well with WebAssembly? It is no_std so doesn't even need WASI in order to build.

Yes, I'm aware of the ground-up work. For our needs, because the Foundation project heavily depends on ICU4C, the migration might not be trivial and take a long time to get all things stable (e.g. API migrations especially for APIs without no drop-in replacement like IDNA, build system integration with Rust toolchain, and so on). At least, it's hard to reason the migration decision just for WASI platform support. But I personally agree it might be a direction we should take in the long term for various other advantages of ICU4X.

It looks like wasi-threads is still stage 1; is that why you can't use it?

First, wasi-threads is a proposal to provide thread creation APIs and not to provide synchronization primitives. And also the proposal is considered legacy and to be replaced with another proposal (but no implementation for the future proposal yet)

The synchronization primitives are provided by a core spec proposal WebAssembly/threads.

Under the circumstances, WASI-SDK currently supports two platform variations:

  1. Only WASI Preview1
    • Pre-compiled libc and libcxx are compiled without WebAssembly/threads feature, no synchroniztion primitive APIs
  2. WASI Preview1 + wasi-threads extension.
    • Pre-compiled libc and libcxx provide synchronization primive APIs

Given that wasi-threads is today consiered legacy, both of them don't satisfy the needs here.

So what we want here is synchrnoization primitives but not thread creation (WebAssembly/threads enabled but wasi-threads disabled). However, both WASI-SDK and even libcxx do not support such platform, "synchronization primitives available but no thread creation" for now.

I understand your concern about poking holes, and it's a typical pain point when porting existing software to WebAssembly. To fill the gap, there is an ongoing effort to provide stubs for synchronization primitive APIs without actual synchronization operations (https://github.com/WebAssembly/wasi-libc/pull/518) in wasi-libc and WASI-SDK side. However, it requires some more effort to bring libcxx's std::mutex and std::atomic, so we don't have a fixed timeline in this direction.

A better solution would be to compile only a subset of ICU4C that doesn't use atomics (I know of clients already doing this), and anything that needs atomics just doesn't get compiled.

Yes, we considered that direction, however the Foundation projects depends on many ICU features including formatters, locales, and timezones and they all use umutex.h, so it's not a realistic approach for us unfortunately.

kateinoigakukun avatar Sep 25 '24 00:09 kateinoigakukun

Thanks @sffc for your feedback

Are you aware of ICU4X, which we've built from the ground up to work well with WebAssembly? It is no_std so doesn't even need WASI in order to build.

Yes, I'm aware of the ground-up work. For our needs, because the Foundation project heavily depends on ICU4C, the migration might not be trivial and take a long time to get all things stable (e.g. API migrations especially for APIs without no drop-in replacement like IDNA, build system integration with Rust toolchain, and so on).

From browsing the project, it looks like most calls of ICU go through here:

https://github.com/swiftlang/swift-corelibs-foundation/blob/main/Sources/CoreFoundation/internalInclude/CFICULogging.h

The bulk of the functions in that list have ICU4X equivalents, including calendrical calculations, datetime formatting, custom datetime patterns and symbols, number formatting, relative datetime formatting (experimental in ICU4X), and list formatting. The only function in that list that sticks out where there's not yet an equivalent is number parsing. You mentioned IDNA; I don't see IDNA in that list, but ICU4X aims to export at least the primitives for IDNA (https://github.com/unicode-org/icu4x/issues/2614).

I can't comment on your concern about Rust toolchains except to observe that other projects with large amounts of legacy C/C++ code and complex build systems have adopted it in the last two years (e.g. Chromium). Having worked on ICU4C for 5 years before working on ICU4X for 3 years, I feel much more confident in Rust's ability to deliver fast, reliable code in the i18n space.

At least, it's hard to reason the migration decision just for WASI platform support. But I personally agree it might be a direction we should take in the long term for various other advantages of ICU4X.

I don't think a migration would be too difficult, though it would be most successful with a Foundation contributor driving the process in collaboration with the ICU4X team who could fill in any potential gaps through the process.

My most recent talk on ICU4X, at UTW 2023, goes over some of the metrics, showing how the team has delivered better code size, memory usage, and performance in most components, especially formatters, which seems to be one of the main components you use in Foundation.

I hope to be giving off a healthy sense of enthusiasm in being a point of contact if you decide to go this route.

I understand your concern about poking holes, and it's a typical pain point when porting existing software to WebAssembly. To fill the gap, there is an ongoing effort to provide stubs for synchronization primitive APIs without actual synchronization operations (WebAssembly/wasi-libc#518) in wasi-libc and WASI-SDK side. However, it requires some more effort to bring libcxx's std::mutex and std::atomic, so we don't have a fixed timeline in this direction.

I see; so it sounds like there is likely room to continue making progress in this space in WASI itself. Do you think once WASI lands support for std::mutex and std::atomic, you might be able to revert this workaround in ICU4C?

Also, it seems like there should already exist polyfills for std::mutex and std::atomic. Can we land a copy of those in ICU4C (assuming compatible licenses) to avoid reinventing the wheel?

sffc avatar Sep 25 '24 02:09 sffc

However, it requires some more effort to bring libcxx's std::mutex and std::atomic, so we don't have a fixed timeline in this direction.

fwiw, if https://github.com/WebAssembly/wasi-libc/pull/518 manages to achieve consensus and get merged, the only thing that has to be done to enable std::mutex and std::atomic is changing a few lines of cmake to ON in WASI-SDK (achieving consensus on that may or may not require yet another discussion, but it's not a difficult technical problem)

ArcaneNibble avatar Sep 25 '24 18:09 ArcaneNibble

fwiw, if https://github.com/WebAssembly/wasi-libc/pull/518 manages to achieve consensus and get merged, the only thing that has to be done to enable std::mutex and std::atomic is changing a few lines of cmake to ON in WASI-SDK (achieving consensus on that may or may not require yet another discussion, but it's not a difficult technical problem)

This is probably the best option as far as upstreaming changes to ICU4C.

srl295 avatar Nov 29 '24 21:11 srl295

I'll back to this once a new wasi-sdk including the pthread stub +libcxx atomic support will be shipped.

kateinoigakukun avatar Nov 30 '24 02:11 kateinoigakukun

Notice: the branch changed across the force-push!

  • icu4c/source/acinclude.m4 is different
  • icu4c/source/common/putilimp.h is different
  • icu4c/source/common/umutex.cpp is no longer changed in the branch
  • icu4c/source/common/umutex.h is no longer changed in the branch
  • icu4c/source/common/unicode/platform.h is no longer changed in the branch
  • icu4c/source/common/unifiedcache.cpp is no longer changed in the branch
  • icu4c/source/configure is different
  • icu4c/source/i18n/decimfmt.cpp is no longer changed in the branch
  • icu4c/source/i18n/number_mapper.h is no longer changed in the branch
  • icu4c/source/i18n/numrange_fluent.cpp is no longer changed in the branch
  • icu4c/source/i18n/unicode/numberrangeformatter.h is no longer changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Notice: the branch changed across the force-push!

  • .github/workflows/icu4c.yml is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Notice: the branch changed across the force-push!

  • .github/workflows/icu4c.yml is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Hi @sffc @srl295, I updated my patch based on recent WASI SDK improvements.

  • https://github.com/unicode-org/icu/pull/3067/commits/06d4d9ed360b00c08931f6849fdedce8ddb436a9 Thanks to @ArcaneNibble and @whitequark's works, std::mutex and std::atomic are now available for wasm32-unknown-wasip1 target, so the patch is now significantly simplified. (See https://github.com/WebAssembly/wasi-sdk/issues/546 for the background)
  • https://github.com/unicode-org/icu/pull/3067/commits/a76738531dc0960f13f69db831e245dbb8a79a57 Added a new CI job to ensure a build with WASI-SDK is healthy

kateinoigakukun avatar Jul 28 '25 05:07 kateinoigakukun

@kateinoigakukun i tried the patches on the main, and it complains regarding U_NL_LANGINFO_CODESET in putil.cpp Screenshot 2025-10-24 at 18 47 30

However adding -DU_HAVE_NL_LANGINFO_CODESET=0 to CXXFLAGS, got rid of the usage and then everything builds fine. Thanks for the work on the PR.

gb-jos avatar Oct 24 '25 17:10 gb-jos

@gb-jos Thank you for trying this :) I locally rebased on the top of the latest main with wasi sdk 27 but I couldn't reproduce it 🤔

kateinoigakukun avatar Oct 25 '25 07:10 kateinoigakukun

@sffc @srl295 Friendly reminder in case this got buried, no rush, just wanted to keep it on your radar!

kateinoigakukun avatar Oct 25 '25 08:10 kateinoigakukun

@sffc @srl295 Friendly reminder in case this got buried, no rush, just wanted to keep it on your radar!

Can you sign the cla ?

srl295 avatar Oct 25 '25 14:10 srl295

Notice: the branch changed across the force-push!

  • .github/workflows/icu4c.yml is different
  • icu4c/source/configure is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@srl295 Thanks! Signed CLA and squashed commits to pass single-commit check.

kateinoigakukun avatar Oct 26 '25 02:10 kateinoigakukun