OpenMS icon indicating copy to clipboard operation
OpenMS copied to clipboard

Arrow Integration + Test in CI

Open Avni2000 opened this issue 6 months ago • 15 comments

Description

Quick note: it seems like the CI pipeline run from develop (on my fork, where develop == upstream/develop) outputs the exact same tests failing as upstream.

With that, I'll try to explain some of the changes made here, because besides the top level ones, it gets murky and difficult to understand at times

Top level changes:

  • Updated github actions to reflect Arrow integration, ie: changing mac (brew) and ubuntu (apt) to add the libarrow and parquet dependencies.
  • Created Arrow_test, much like hdf5, to test arrow functionality by mimicking an example use case from the apache arrow website.
  • Added an option in CMakeLists.txt to look for arrow/parquet dependencies as well (if one is building from source)

The rationale behind conditional JSON + explicit conversion + conditional parquet: Thus far, I've yet to figure out how to integrate arrow (and its subdirectories) into OpenMS' contrib, since it doesn't follow the simple lib + headers + src structure well -- instead it contains src/parquet + src/gandiva + src/arrow + ... That is to say, since Windows' CI structure has it depend on contrib for its dependencies, I had to create a few extra conditionals for WITH_PARQUET and -DUSE_EXTERNAL_JSON.

Re: USE_EXTERNAL_JSON: It should be fairly decently commented, but essentially, mac w/ brew DOESN'T bundle nlohmann_json and ubuntu does w/ apt. As such, reconciling the two necessitates either a) changing the ubuntu package to not bundle with nlohmann_json in the first place, which doesn't seem doable, or b) making it so that if we're using Ubuntu, then we must utilize the Arrow's version of Nlohmann_JSON. I elected for (b)

This then, brought its own set of issues, the largest being that implicit conversion differed somehow across versions. Even though Nlohmann_json docs specify that it's automatically turned on with USE_IMPLICIT_CONVERSION (I believe), it seems as if Arrow's JSON only allows explicit conversion. So I turned Implicit conversion off in its cmake (for readability) and added a comment where Arrow's JSON was used to specity that the file must be written with explicit handling of Nlohmann JSON methods.

I recently attempted taking the explicit conversion logic out here: https://github.com/Avni2000/OpenMS/actions/runs/15864899276/job/44729833640 and the build failed because of said implicit conversion errors with Arrow's JSON.

Here's an example of where everything succeeds, at least as well as upstream/develop does: https://github.com/Avni2000/OpenMS/actions/runs/15865087724/job/44730344367

Checklist

  • [ ] Make sure that you are listed in the AUTHORS file -> not yet
  • [ ] Add relevant changes and new features to the CHANGELOG file -> nothing too big for non developers
  • [X] I have commented my code, particularly in hard-to-understand areas
  • [X] New and existing unit tests pass locally with my changes -> old unit tests fail without my changes, I believe.
  • [X] Updated or added python bindings for changed or new classes (Tick if no updates were necessary.)

How can I get additional information on failed tests during CI

Click to expand If your PR is failing you can check out
  • The details of the action statuses at the end of the PR or the "Checks" tab.
  • http://cdash.seqan.de/index.php?project=OpenMS and look for your PR. Use the "Show filters" capability on the top right to search for your PR number. If you click in the column that lists the failed tests you will get detailed error messages.

Advanced commands (admins / reviewer only)

Click to expand
  • /reformat (experimental) applies the clang-format style changes as additional commit. Note: your branch must have a different name (e.g., yourrepo:feature/XYZ) than the receiving branch (e.g., OpenMS:develop). Otherwise, reformat fails to push.
  • setting the label "NoJenkins" will skip tests for this PR on jenkins (saves resources e.g., on edits that do not affect tests)
  • commenting with rebuild jenkins will retrigger Jenkins-based CI builds

:warning: Note: Once you opened a PR try to minimize the number of pushes to it as every push will trigger CI (automated builds and test) and is rather heavy on our infrastructure (e.g., if several pushes per day are performed).

Summary by CodeRabbit

  • New Features

    • Added optional support for Apache Arrow and Parquet file formats, controllable via a build option.
    • Introduced comprehensive tests for reading and writing Arrow and Parquet files.
  • Bug Fixes

    • Improved type safety and JSON (de)serialization in ribonucleotide parsing.
  • Chores

    • Updated CI and build scripts to handle platform-specific dependencies for Arrow and Parquet.
    • Enhanced .gitignore to exclude build directories.
    • Updated dependency installation scripts for macOS and Ubuntu to include Apache Arrow and Parquet.
  • Documentation

    • Clarified comments and error messages for better readability.

Avni2000 avatar Jun 25 '25 13:06 Avni2000

"""

Walkthrough

This update introduces conditional build and test support for Apache Arrow and Parquet libraries in the OpenMS project, primarily via new CMake options and CI workflow logic. It adds new tests for Arrow/Parquet IO, updates dependency installation scripts, and enhances JSON serialization for EmpiricalFormula. Several configuration and ignore files are also updated.

Changes

Files/Paths Change Summary
.github/workflows/openms_ci_matrix_full.yml, tools/ci/cibuild.cmake Add environment variables WITH_PARQUET and USE_EXTERNAL_JSON; set conditionally in CI and load in build script.
.gitignore Add OpenMS-build/* to ignored paths.
CMakeLists.txt, cmake/OpenMSConfig.cmake.in, cmake/cmake_findExternalLibs.cmake Add WITH_PARQUET option; update comments; add Arrow/Parquet package detection if WITH_PARQUET is ON.
src/openms/CMakeLists.txt, src/tests/class_tests/openms/CMakeLists.txt, executables.cmake Conditionally link Arrow and Parquet libraries and test executables if WITH_PARQUET is enabled.
src/openms/extern/CMakeLists.txt Set JSON_USE_IMPLICIT_CONVERSIONS OFF when using external nlohmann_json.
src/openms/source/CHEMISTRY/RibonucleotideDB.cpp Add custom JSON (de)serialization for EmpiricalFormula; update parsing logic for type safety.
src/tests/class_tests/openms/source/Arrow_test.cpp New test file for Arrow and Parquet IO: creation, reading, and writing in multiple formats.
doc/doxygen/test/Doxygen_Warning_Checker.cpp Change output message from "Doxygen warnings" to "Doxygen errors."
tools/ci/deps-macos.sh, tools/ci/deps-ubuntu.sh Add Apache Arrow (and Parquet on Ubuntu) to required dependencies installation scripts.

Sequence Diagram(s)

sequenceDiagram
    participant CI as GitHub Actions CI
    participant CMake as CMake Build System
    participant OS as OS Environment
    participant Arrow as Apache Arrow/Parquet

    CI->>OS: Set WITH_PARQUET/USE_EXTERNAL_JSON based on OS
    CI->>CMake: Trigger build with env vars
    CMake->>CMake: Check WITH_PARQUET option
    alt WITH_PARQUET=ON
        CMake->>Arrow: find_package(Arrow, Parquet)
        CMake->>CMake: Link Arrow/Parquet to targets
    end
    CMake->>CMake: Build OpenMS with/without Arrow/Parquet
sequenceDiagram
    participant Test as Arrow_test.cpp
    participant Arrow as Apache Arrow
    participant Parquet as Apache Parquet

    Test->>Arrow: Create Arrow Table (days, months, years)
    Test->>Arrow: Write to IPC, CSV
    Test->>Parquet: Write to Parquet
    Test->>Arrow: Read IPC, write new IPC
    Test->>Arrow: Read CSV, write new CSV
    Test->>Parquet: Read Parquet, write new Parquet

Suggested labels

Review effort 2/5

Suggested reviewers

  • timosachsenberg

Poem

In the land where data streams and flows,
Parquet and Arrow now compose—
With CMake flags and JSON neat,
Our builds and tests are quite complete!
Ubuntu, Mac, and Windows too,
The bunny hops through tasks anew.
🐇✨ """


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

coderabbitai[bot] avatar Jun 25 '25 13:06 coderabbitai[bot]

@timosachsenberg @jpfeuffer would love to get your feedback on this, particularly the somewhat convoluted Nlohmann_json part. Let me know if I can clarify anything.

Avni2000 avatar Jun 25 '25 13:06 Avni2000

First batch of general comments:

Arrow does not necessarily need to be in contrib. You could just install it from vcpkg or conda. But even if you put it in contrib, the structure with subfolders should not be a problem. Our contrib basically calls their CMake commands and installs it in a specific output folder in a way that find_package scripts find the output artifacts there. See boost, which is a very complicated build system (not even Cmake) with multiple subpackages.

Yes, implicit conversions should definitely go! It will be deprecated in the next json version anyway. Should be as easy as adding conversion at the points where we are filling string fields with json entry objects and vice-versa.. Just use entry.get<std::string>() everywhere.

Yes, nlohmann_json should always be the same as the one used by arrow (if used). Therefore nlohmann_json should always be first looked up with find_package and only if not available built together with OpenMS. If arrow has a good CMake config, they will just call find_dependecy for you and you will get the nlohmann_json target anyway for free.

jpfeuffer avatar Jun 25 '25 13:06 jpfeuffer

I don't really understand if the json explicit conversion is working now or not? You have changes here that apparently pass on Linux. So what's working and what is not?

jpfeuffer avatar Jun 25 '25 18:06 jpfeuffer

I don't really understand if the json explicit conversion is working now or not? You have changes here that apparently pass on Linux. So what's working and what is not?

@jpfeuffer @timosachsenberg We have 2 versions of json involved and two versions of arrow involved. Json explict conversion was always working, the problem was with implicit conversions.

I should've clarified more and I'm a maths major so I'll use some more specific terminology.

Let there be two versions of Arrow: Brew_Arrow and Apt_Arrow, corresponding to mac and linux installations respectively.

Brew_Arrow does NOT bundle ANY Nlohmann json dependencies with it. Apt_Arrow does, likely because the Apt_Arrow is from apache's latest arrow version (note the wget)

So, how do we reconcile these differences? On one end, we just introduced an external json with Apt_Arrow, but ONLY for Apt_Arrow. On the other end, we have Brew_Arrow which is probably a bit behind without said bundled Nlohmann Json. And like you'd mentioned, we'd prefer to support the explicit conversion regardless.

I'm unsure why, and you can look at my original comment at the top where it threw a bunch of implicit conversion errors, even though it was turned on (see git logs for IMPLICIT_CONVERSION), but adding in Nlohmann_JSON on linux did require me to a) turn on USE_EXTERNAL_JSON (which is intuitive and a given) but also b) convert implicit conversions to explicit in ribonucleotideDB

Avni2000 avatar Jun 25 '25 23:06 Avni2000

First batch of general comments:

Arrow does not necessarily need to be in contrib. You could just install it from vcpkg or conda. But even if you put it in contrib, the structure with subfolders should not be a problem. Our contrib basically calls their CMake commands and installs it in a specific output folder in a way that find_package scripts find the output artifacts there. See boost, which is a very complicated build system (not even Cmake) with multiple subpackages.

Yes, implicit conversions should definitely go! It will be deprecated in the next json version anyway. Should be as easy as adding conversion at the points where we are filling string fields with json entry objects and vice-versa.. Just use entry.get<std::string>() everywhere.

Yes, nlohmann_json should always be the same as the one used by arrow (if used). Therefore nlohmann_json should always be first looked up with find_package and only if not available built together with OpenMS. If arrow has a good CMake config, they will just call find_dependecy for you and you will get the nlohmann_json target anyway for free.

On that subfolders part, I'm trying out a new approach with cmake like here: cmake config + utilization and I'll open up an issue by end of Friday (CST) if I can't figure it out and show you my approach.

Avni2000 avatar Jun 25 '25 23:06 Avni2000

I get it now. The reason why you saw the errors on Linux is most likely that your CMake setting had no effect on the already built external json lib. You need to set a compile macro as explained in the comment above. If everything is working with explicit conversions now, you should remove your attempt to set implicit conversions to ON for external. And at the same time set that option to OFF in our internal build as well! To keep things consistent.

jpfeuffer avatar Jun 26 '25 08:06 jpfeuffer

I get it now. The reason why you saw the errors on Linux is most likely that your CMake setting had no effect on the already built external json lib. You need to set a compile macro as explained in the comment above. If everything is working with explicit conversions now, you should remove your attempt to set implicit conversions to ON for external. And at the same time set that option to OFF in our internal build as well! To keep things consistent.

Agreed, when you mentioned the target compile thing yesterday, I thought this was the case as well - thanks for the help

Avni2000 avatar Jun 26 '25 12:06 Avni2000

@jpfeuffer I have the option to export an arrow config file with -DARROW_JSON=ON - should I opt to use this instead and get rid of Nlohmann json through extern entirely?

Avni2000 avatar Jun 27 '25 02:06 Avni2000

Hi! So, if we go the contrib way (remember there might be other ways to get arrow on Win), arrow should be built with as few optional dependencies as possible. We have to find out which those are. However, ideally, since we depend on nlohmann_json ourselves already, nlohmann_json should be refactored to be built in contrib as well, built before arrow, and then used during the arrow build. Lastly we link to both in OpenMS.

Having said that: we are basically building our own dependency manager here and we are SO CLOSE to finish external dependency management in BOTH conda AND vcpkg, I personally think that now would be a great time to finally get one of those running instead.

jpfeuffer avatar Jun 27 '25 08:06 jpfeuffer

The way we do it in contrib is particularly convoluted, I agree. Does that mean that we nix contrib altogether and fix the corresponding bugs re. cmakelists? Or does it just mean replace contrib w/ conda/vcpkg on windows, and then continue with contrib on the actual prod version of OpenMS?

I'm curious as to how that would get handled by the current cmakelists in OpenMS (excluding the contrib submodule). Are you also shipping an environment.yml for conda? Or since cmakelists handles installations and not the actual install portion, does that imply we're just leaving them alone?

TLDR: confused about how conda/vcpkg fits into the build system. I'm more than open to trying anything, and if you want to point me towards a branch/fork wherein you're working on this, I'd love to take a look and adapt my code with the expectation of it being merged soon.

Avni2000 avatar Jun 27 '25 12:06 Avni2000

Conda or vcpkg will be the thing that installs/builds the dependencies. You just point OpenMS' Cmake to where they install the libs and our CMake should pick them up. The only problems are usually that Cmake configs or naming schemes of libraries often differ from installing via brew or apt and you need some tweaks in CMake.

You could actually install just arrow with Conda but you have to hope that Conda builds the library with a compatible environment (C compiler, runtime). On windows it might actually work.

conda env create -p ./.conda -f env.yml cmake -DCMAKE_PREFIX_PATH="./conda/Library;./conda" ../OpenMS

vcpkg has full integration with cmake. You just need one line or so in CMake to activate a .vcpkg.json file in the repo (which already exists but is probably not working completely). There is a PR here somewhere but maybe one should start from scratch with their documentation.

jpfeuffer avatar Jun 27 '25 18:06 jpfeuffer

@Avni2000 I think you could give installing arrow on windows with conda a try. In contrast to vcpkg you get already compiled binaries. vcpkg is often more compatible but you need to actually build the library. Don't know if this is too much for our CI.

timosachsenberg avatar Jun 30 '25 07:06 timosachsenberg

For vcpkg you could also enable binary caching and like we do for contrib and just publish to GH packages. Then register this as a source: https://learn.microsoft.com/en-us/vcpkg/consume/binary-caching-github-packages?pivots=windows-runner

jpfeuffer avatar Jun 30 '25 07:06 jpfeuffer

For vcpkg you could also enable binary caching and like we do for contrib and just publish to GH packages. Then register this as a source: https://learn.microsoft.com/en-us/vcpkg/consume/binary-caching-github-packages?pivots=windows-runner

ah didn't know that this now works through GH packages

Then I would actually prefer to get rid of contrib and use vcpkg with caching - if (!) this works for arm etc.

@Avni2000 but maybe a quick check with conda to get the PR merged would be what I would prioritize now. Switching to vcpkg could be done in another PR.

timosachsenberg avatar Jun 30 '25 08:06 timosachsenberg

failing test unrelated... let's merge

timosachsenberg avatar Jul 07 '25 14:07 timosachsenberg