icu icon indicating copy to clipboard operation
icu copied to clipboard

ICU-7747 CMake Port

Open clemenswasser opened this issue 3 years ago • 35 comments

Checklist
  • [x] Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-7747
  • [x] Required: The PR title must be prefixed with a JIRA Issue number.
  • [x] Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • [x] Required: Each commit message must be prefixed with a JIRA Issue number.
  • [ ] Issue accepted (done by Technical Committee after discussion)
  • [x] Tests included, if applicable
  • [ ] API docs and/or User Guide docs changed or added, if applicable
Summary

This adds CMake and CTest support to ICU4C. I haven't yet tested this on all platforms and there is still some issue with the generation of icudt.dll, which causes a ICU Initialization returned: U_FILE_ACCESS_ERROR when running icuinfo and most tests, hence this is marked as a draft. Since I'm not so familiar with the icudt generation, I hope that you can help me with fixing this issue.

clemenswasser avatar Nov 06 '22 12:11 clemenswasser

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

Hi @markusicu @srl295, you both seem to be involved with the Jira Ticket ICU-7747, is there something I can do to move this PR forward? As I've stated in the initial description, in the current form it already builds on some platforms, but there is currently still some issue with icudt dll generation, which causes most tests to fail. I tried to find what's missing, but with my limited ICU expertise I didn't get very far with debugging this issue, hence I would love to get some feedback/suggestions from you. My current guess is, that there is some sort of binary patching of the U_ICUDATA_ENTRY_POINT variable happening during the "makedata" step, which doesn't happen in my CMake port.

clemenswasser avatar Nov 17 '22 18:11 clemenswasser

Can you add a cmake install target as well? I'm looking to use this in my own build files but I need to be able to install ICU to a prefix/sysroot.

theoparis avatar Mar 15 '23 07:03 theoparis

Hi. By the way, as I noted https://github.com/mesonbuild/wrapdb/issues/950#issuecomment-1490972572 there's a sources.txt and dependencies files, which should be used for getting the source files. My cmake branch used it. gencmake.py

srl295 avatar Mar 30 '23 21:03 srl295

@clemenswasser ^ that link to gencmake.py goes to my old branch. And by old i mean old.

srl295 avatar Mar 30 '23 21:03 srl295

My current guess is, that there is some sort of binary patching of the U_ICUDATA_ENTRY_POINT variable happening during the "makedata" step, which doesn't happen in my CMake port.

Here's how it works.

  1. source/stubdata is built into a library. It's like a data library but with no data.
  2. common and i18n and others are built, linked against stubdata
  3. now the tools can be built, linked against above
  4. now data can be generated. This creates a new library

Now the end user can link against shared libraries: common + i18n + data (with full data).

srl295 avatar Mar 30 '23 21:03 srl295

  1. source/stubdata is built into a library. It's like a data library but with no data.
  2. common and i18n and others are built, linked against stubdata
  3. now the tools can be built, linked against above
  4. now data can be generated. This creates a new library

Is this the reason why ICU needs a pre-compilation in the host architecture for cross-compiling? It's something I really hate about the current build system of ICU.

ceztko avatar Mar 30 '23 21:03 ceztko

@ceztko this is for a bootstrap from source (i.e. git). I hate doing rework too, but what else would you do?

if you have a release tarball, it has source/data/in/icudt72l.dat which can be used directly (in mmap or fread mode). In that case, you can skip the tools part and just use that as your icu data dir. or provide it another way/ That may make more sense for cross compilation.

But, at least what it means is that if you have such a file, you can skip most of the tools, you only need enough to convert icudt72l.dat to libicudt.so. (That's how I have node.js do its icu build)

srl295 avatar Mar 30 '23 21:03 srl295

But, at least what it means is that if you have such a file, you can skip most of the tools, you only need enough to convert icudt72l.dat to libicudt.so. (That's how I have node.js do its icu build) [...] but what else would you do?

I would depend on an highly available scripting language (eg. python) to generate the compilation units. Maybe that's what you were trying already with gencmake.py?

ceztko avatar Mar 30 '23 21:03 ceztko

These are two different issues. gencmake.py is for the code structure. It's a way to slice up which .cpp's need to be included with finer granularity than the library.

The reason for the double build in the case of cross platform has to do with the original data, which needs ICU source to run. It's really a bootstrap issue.

Converting a binary .dat file to a .so with a large pure text segment could be done a number of ways. There's probably some other tool out there to do it by now. (Now as in, not 2002)

The dependencies among the data generators themselves are already implemented in python, that's BUILDRULES.py

srl295 avatar Mar 30 '23 22:03 srl295

The reason for the double build in the case of cross platform has to do with the original data, which needs ICU source to run. It's really a bootstrap issue.

Yes, I understood. As you say, it would be easier to trim the pre-compilation when the .dat is available, which it would make the life easier for most people that just want to use ICU. Anyway, I would welcome a CMake port a lot, with or without the pre-build step, because it makes cross-compilation much easier in general.

ceztko avatar Mar 30 '23 22:03 ceztko

The reason for the double build in the case of cross platform has to do with the original data, which needs ICU source to run. It's really a bootstrap issue.

Yes, I understood. As you say, it would be easier to trim pre-compilation when the .dat is available, which it would make the life easier for most people that just want to use ICU. Anyway, I would welcome a CMake port a lot, with or without the pre-build step, because it makes cross-compilation much easier in general.

the .dat file is already there in the .tgz and .zip files. And for that matter, building data should be able to use the python script for the data deps.

srl295 avatar Mar 30 '23 22:03 srl295

Is this the reason why ICU needs a pre-compilation in the host architecture for cross-compiling? It's something I really hate about the current build system of ICU.

As a matter of curiosity -- any particular challenge this causes, other than the general need to possess both a host and a build compiler?

eli-schwartz avatar Mar 31 '23 10:03 eli-schwartz

Notice: the branch changed across the force-push!

  • icu4c/source/CMakeLists.txt is different
  • icu4c/source/common/CMakeLists.txt is different
  • icu4c/source/data/makedata.mak is different
  • icu4c/source/i18n/CMakeLists.txt is different
  • icu4c/source/io/CMakeLists.txt is different
  • icu4c/source/stubdata/CMakeLists.txt is different
  • icu4c/source/test/intltest/CMakeLists.txt is different
  • icu4c/source/tools/ctestfw/CMakeLists.txt is different
  • icu4c/source/tools/genbrk/CMakeLists.txt is different
  • icu4c/source/tools/genccode/CMakeLists.txt is different
  • icu4c/source/tools/gencfu/CMakeLists.txt is different
  • icu4c/source/tools/gencmn/CMakeLists.txt is different
  • icu4c/source/tools/gencnval/CMakeLists.txt is different
  • icu4c/source/tools/gendict/CMakeLists.txt is different
  • icu4c/source/tools/gennorm2/CMakeLists.txt is different
  • icu4c/source/tools/genrb/CMakeLists.txt is different
  • icu4c/source/tools/gensprep/CMakeLists.txt is different
  • icu4c/source/tools/gentest/CMakeLists.txt is different
  • icu4c/source/tools/icuexportdata/CMakeLists.txt is different
  • icu4c/source/tools/icuinfo/CMakeLists.txt is different
  • icu4c/source/tools/icupkg/CMakeLists.txt is different
  • icu4c/source/tools/icuswap/CMakeLists.txt is different
  • icu4c/source/tools/makeconv/CMakeLists.txt is different
  • icu4c/source/tools/pkgdata/CMakeLists.txt is different
  • icu4c/source/tools/toolutil/CMakeLists.txt is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

As a matter of curiosity -- any particular challenge this causes, other than the general need to possess both a host and a build compiler?

This was done at the time of icu 59, so it passed some time and things may have changed. Also some things may be incorrect. Look at the following:

if (ANDROID OR IOS)
    set(ICU_CROSS_BUILD_DIR "${CMAKE_CURRENT_BINARY_DIR}/icu_cross-prefix/src/icu_cross/source")
    set(ICU_CROSS_FLAGS "--with-cross-build=${ICU_CROSS_BUILD_DIR}")
    set(ENABLE_ICU_TOOLS no)
else()
    # Non cross builds needs tools compiled
    set(ENABLE_ICU_TOOLS yes)
endif()

ExternalProject_Add(icu
    URL "${ARCHIVE_PATH}/icu4c-59_2-src.tgz"
    PATCH_COMMAND ${PATCH_BIN} -p1 < "${PATCHES_PATH}/icu.patch"
    CONFIGURE_COMMAND cd source
        && ./configure "--prefix=${DEPS_OUTPUT_PATH}" ${CONFIGURE_FLAGS}
            --enable-static=yes
            --enable-shared=no
            --enable-extras=no
            --enable-strict=no
            --enable-icuio=no
            --enable-layout=no
            --enable-layoutex=no
            --enable-tests=no
            --enable-samples=no
            --enable-tools=${ENABLE_ICU_TOOLS}
            --enable-dyload=no
            --with-data-packaging=archive
            ${ICU_CROSS_FLAGS}
    BUILD_COMMAND cd source && make
    INSTALL_COMMAND cd source && make install
    BUILD_IN_SOURCE 1
    EXCLUDE_FROM_ALL TRUE
)

if (ANDROID OR IOS)
    # ICU cross build tree, needed just to build cross compiled archs
    ExternalProject_Add(icu_cross
        URL "${ARCHIVE_PATH}/icu4c-59_2-src.tgz"
        PATCH_COMMAND ${PATCH_BIN} -p1 < "${PATCHES_PATH}/icu.patch"
        CONFIGURE_COMMAND cd source
            && ./runConfigureICU Linux
                CFLAGS=-Os
                CXXFLAGS=--std=c++11
                --enable-static=yes
                --enable-shared=no
                --enable-extras=no
                --enable-strict=no
                --enable-icuio=no
                --enable-layout=no
                --enable-layoutex=no
                --enable-tests=no
                --enable-samples=no
                --enable-dyload=no
        BUILD_COMMAND cd source && make
        INSTALL_COMMAND "" # No need to install the cross build tree
        BUILD_IN_SOURCE 1
        EXCLUDE_FROM_ALL TRUE
    )
    add_dependencies(icu icu_cross)
endif()

Suppose ${CONFIGURE_FLAGS} already contains the flags ready for cross compiling. I had to:

  • Specify a different configure script runConfigureICU than the regular compilation for the boostrap one;
  • Specify different flags for compilation (eg. CXXFLAGS=--std=c++11, otherwise compilation was failing);
  • Specify correctly the --with-cross-build for the cross-compilation. The exact folder to specify was not immediately clear to me;
  • Specify different --enable-tools depending on cross/non cross compilation;
  • Specify different --with-data-packaging depending on cross/non cross compilation;

I discovered some of these things from the documentation, and some I understood how to set correctly after some trial and error. In the end, I want just to use ICU and I don't like spend too much time on building and testing deps for a multitude of mobile architectures that need cross-compilation. Maybe looking at it today it doesn't look that terrible but really I would be relieved if ICU provided a way to not require a boostrap compilation at all when the .dat is available.

ceztko avatar Mar 31 '23 15:03 ceztko

As a matter of curiosity -- any particular challenge this causes, other than the general need to possess both a host and a build compiler?

This was done at the time of icu 59, so it passed some time and things may have changed. Also some things may be incorrect. Look at the following:

…
if (ANDROID OR IOS)
    # ICU cross build tree, needed just to build cross compiled archs
…
            && ./runConfigureICU Linux

So the 'host' cross tree above could build a subset of the tools and libs, and data. There are probably some compile options that could be added.

Suppose ${CONFIGURE_FLAGS} already contains the flags ready for cross compiling. I had to:

  • Specify a different configure script runConfigureICU than the regular compilation for the boostrap one;
  • Specify different flags for compilation (eg. CXXFLAGS=--std=c++11, otherwise compilation was failing);
  • Specify correctly the --with-cross-build for the cross-compilation. The exact folder to specify was not immediately clear to me;
  • Specify different --enable-tools depending on cross/non cross compilation;
  • Specify different --with-data-packaging depending on cross/non cross compilation;

I discovered some of these things from the documentation, and some I understood how to set correctly after some trial and error. In the end, I want just to use ICU and I don't like spend too much time on building and testing deps for a multitude of mobile architectures that need cross-compilation. Maybe looking at it today it doesn't look that terrible but really I would be relieved if ICU provided a way to not require a boostrap compilation at all when the .dat is available.

I wrote the --with-cross-build stuff and the documentation, definitely would accept any improvements.

The only reason that icudata is a .so is for user convenience (at the expense of INconvenience of us builders, sorry :-) …  so that the system library loader can find ICU's data. But if you have another mechanism, that can be used too. Lots of other ways.


I hear you, you just want to use it. I think the only way you could avoid that 2nd ICU build is if there was some other tool to build the data .so from the binary file.

The other way, is that if the target build uses data packaging of "archive" the cross build isn't needed, and you just pass or load the .dat file on the architecture. And ship with the stubdata dll. That's the only thing the "host" icu is used for.

Windows has .rc files, which seem like they can include data files into a DLL. Qt has something similar.

Actually, what Might make sense is to do some major refactoring of the ICU tools, and make just the resource tooling build.

OK. I've convinced myself that building the tools without the rest of ICU would probably be too painful. However… There might be a way to have a #define switch, something like U_DAT_TOOLS_ONLY that turns off 90% of ICU and makes that host build more snappy, but ONLY handle the .dat-to-.so case, or maybe some simple file-level repackaging.

Ideally it should include genrb/derb also, for compiling package manifests (without them you can't add or remove locales). But that by default brings in most of ICU if it encounters collator data, etc etc. Again, there could be a switch that says, no, this is a limited genrb, it can't handle all possible ICU data.

srl295 avatar Mar 31 '23 16:03 srl295

Ah right, the cmake cross compile situation ugliness, how could I forget... You can cross build the tools you need, it's just painfully manual.

I'm used to doing this -- from the meson port for ICU that @srl295 cross referenced above:

genccode_exe = executable(
  'genccode',
  sources,
  dependencies: toolutil_dep,
  install: true,
)

if meson.can_run_host_binaries()
  genccode_native_exe = genccode_exe
else
  genccode_native_exe = executable(
    'genccode-native',
    'genccode.c',
    dependencies: toolutil_native_dep,
    native: true,
  )
endif

The native: true kwarg here is key. It says that you're building a tool that must be built by the build compiler, not the cross one, meson tries to find both for you. A native target can be run by other build steps but not installed, and can be linked to dependencies that are either looked up by dependency('...', native: true) or built internally with the same kwarg. The same way this native genccode is defined, toolutil_dep and toolutil_native_dep are defined.

In the event that meson detects you can run cross-compiled binaries anyway, it doesn't bother to build additional copies of the installable genccode. This could happen for example when building mingw copies of ICU under Linux, and define wine as a cross wrapper, or when cross compiling from Linux x86-64 to aarch64 but with qemu-user set as a cross wrapper.

You can do the same thing with autotools although it's not as simple to author the necessary configure/make bits. The basic trick is to compile the files with $(CC_FOR_BUILD).

eli-schwartz avatar Mar 31 '23 17:03 eli-schwartz

we'd all rather forget !

Also you only need the non-native genccode if the end user wants to build/rebuild their own packages on the target arch. I would put that in a developer tools package if i were packaging icu4c such as for a platform, separate from the library itself.

srl295 avatar Mar 31 '23 17:03 srl295

The only reason that icudata is a .so is for user convenience (at the expense of INconvenience of us builders, sorry :-) …  so that the system library loader can find ICU's data. But if you have another mechanism, that can be used too. Lots of other ways. I hear you, you just want to use it. I think the only way you could avoid that 2nd ICU build is if there was some other tool to build the data .so from the binary file.

I don't know exactly how icudata is turned into .so. Sorry if I am saying something very obvious, but if it's just a single blob of data that must be available from a symbol at runtime (but I really don't know) then I would create a shim header and optionally generate the source calling of the compilation unit with the defined data in octal form as the next sample:

const char s_srgb_icc[] = "\
\000\000\014\110\114\151\156\157\002\020\000\000\155\156\164\162\
\122\107\102\040\130\131\132\040\007\316\000\002\000\011\000\006\
...

There's no ubiquitous scripting language that can be picked for the task, but one may assume that this could be done in python if python interpreter is found and a .dat file is specified. If requested I can write the python script to convert a blob to a cpp file, but I suspect that's not really problem (unless you welcome a small help). Other methods may be not portable (eg. in Windows), or require non ubiquitous utilities,.

ceztko avatar Mar 31 '23 19:03 ceztko

Other methods may be not portable (eg. in Windows), or require non ubiquitous utilities,.

The epitome of a non ubiquitous utility would be a c23 compiler with support for #embed, but we can dream...

Note: before #embed there's no such thing as a portable approach that is also fast to build.

eli-schwartz avatar Mar 31 '23 19:03 eli-schwartz

c23…  that's available in Ubuntu 14.04 LTS right??

Seriously: Thanks, filed https://unicode-org.atlassian.net/browse/ICU-22343

@ceztko what you described is what the genccode utility does.

it generates assembly where it can, because that's a lot faster and doesn't run into compiler memory issues.

and on windows it goes directly from a .dat file to a PE because there are utilities for that.

srl295 avatar Mar 31 '23 19:03 srl295

@ceztko what you described is what the genccode utility does.

and on windows it goes directly from a .dat file to a PE because there are utilities for that.

I could imagine you had an utility doing exactly that task, but again this a problem of not falling into chicken-and-egg situation and to skip a build you'll have to do it using something more ubiquitous. You could for example benefit if genccode is already installed in Unix platforms, and use directly PE utilities in Windows. Maybe this could be a good compromise?

ceztko avatar Mar 31 '23 20:03 ceztko

@ceztko Are there PE utilities which will do this? Yes. you could detect genccode (and other ICU utils) in the environment. Would take some rework but that could be possible. You need the ICU utils to be able to create the target environment's format.

I did find this https://stackoverflow.com/questions/4158900/embedding-resources-in-executable-using-gcc

srl295 avatar Mar 31 '23 20:03 srl295

Are there PE utilities which will do this?

Maybe I misunderstood a previous message of yours where you talked about use of PE utilities? Anyway, assuming ICU utils can be easily installed on Unix, which I think it is true in most situations, an alternative in Windows may be use of of Windows Resource Compiler. I can have a look if it fits enough: for sure with CMake it's quite easy to use .rc files using MSVC as the compiler (I did it multiple times), but I never tried to use them to embed files.

ceztko avatar Mar 31 '23 20:03 ceztko

Are there PE utilities which will do this?

Maybe I misunderstood a previous message of yours where you talked about use of PE utilities?

I guess it's Windows and also ELF on linux. search icu codebase for CAN_GENERATE_OBJECTS - genccode writes directly to a .o file, which is then linkable.

Anyway, assuming ICU utils can be easily installed on Unix, which I think it is true in most situations, an alternative in Windows may be use of of Windows Resource Compiler. I can have a look if it fits enough: for sure with CMake it's quite easy to use .rc files using MSVC as the compiler (I did it multiple times), but I never tried to use them to embed files.

Would it be happy with a 30Mb .dat file though? not sure

srl295 avatar Mar 31 '23 20:03 srl295

Ultimately the goal here is, I think, to explicitly eschew the use of official solutions such as producing a C source file containing octal data. Because... it can get extremely slow depending on compiler, and is never really fast. With MSVC it's a common experience to simply run out of memory. The task starts getting hopeless past like 4mb.

Instead, sideloading your own objects can be extremely advantageous for build speed. It's also why #embed is so useful (or will be once compilers implement it, now that we're guaranteed by the newest standard that it shall exist).

Here's some discussion on it by the author: https://thephd.dev/full-circle-embed#speed-and-space-is-everything https://thephd.dev/embed-the-details https://thephd.dev/finally-embed-in-c23

eli-schwartz avatar Mar 31 '23 21:03 eli-schwartz

Ultimately the goal here is, I think, to explicitly eschew the use of official solutions such as producing a C source file containing octal data. Because... it can get extremely slow depending on compiler, and is never really fast. With MSVC it's a common experience to simply run out of memory. The task starts getting hopeless past like 4mb.

but to be clear: a c source file with octal data is only done when no other mechanism is available… i.e. writing a .o file (linux elf, windows PE), generating asm (most others…)

So yes it's super slow. But it hasn't been common for maybe 15 years here.

Instead, sideloading your own objects can be extremely advantageous for build speed. It's also why #embed is so useful (or will be once compilers implement it, now that we're guaranteed by the newest standard that it shall exist).

Here's some discussion on it by the author: https://thephd.dev/full-circle-embed#speed-and-space-is-everything https://thephd.dev/embed-the-details https://thephd.dev/finally-embed-in-c23

i saw that one also. if there are any compilers which support this, then we could start to have a PR to allow it when available.

srl295 avatar Mar 31 '23 21:03 srl295

I can have a look if it fits enough: for sure with CMake it's quite easy to use .rc files using MSVC as the compiler (I did it multiple times), but I never tried to use them to embed files.

Would it be happy with a 30Mb .dat file though? not sure

I just tried and yes it handles it and it's very fast. Give me a second to craft a full example.

ceztko avatar Mar 31 '23 21:03 ceztko

I can have a look if it fits enough: for sure with CMake it's quite easy to use .rc files using MSVC as the compiler (I did it multiple times), but I never tried to use them to embed files.

Would it be happy with a 30Mb .dat file though? not sure

I just tried and yes it handles it and it's very fast. Give me a second to craft a full example.

that's great and definitely simpler.

srl295 avatar Mar 31 '23 22:03 srl295

Ok, you can find it attached here: TestResourcesWin.zip [EDIT: Updated the zip]

On Windows I just do the usual:

md build
cd build
cmake ..

The resource is declared with a simple Resources.rc

#include <winres.h>

MyRes UserResources data.res

That loads data.res which is just the string "Hello World" followed by 30mb or null characters.

I test it with the following source:

#include <Windows.h>
#include <cassert>
#include <string_view>

using namespace std;

int main()
{
    HRSRC hResInfo = FindResourceA(NULL, "MyRes", "UserResources");
    HGLOBAL hRes = LoadResource(NULL, hResInfo);
    LPVOID lpRes = LockResource(hRes);
    DWORD dwSize = SizeofResource(NULL, hResInfo);
    string_view view((const char*)lpRes);
    assert(view == "Hello World");
    return 0;
}

It was much more easier than I expected, but this was aided by CMake that does know exactly how to deal with .rc files. I have no idea how this should implemented for the current build system of ICU on Windows, but after all we are talking about this topic in a PR about a CMake port, which should definitely ease builds on Windows as well.

ceztko avatar Mar 31 '23 22:03 ceztko