Arrow Integration + Test in CI
Description
Quick note: it seems like the CI pipeline run from develop (on my fork, where develop == upstream/develop) outputs the exact same tests failing as upstream.
With that, I'll try to explain some of the changes made here, because besides the top level ones, it gets murky and difficult to understand at times
Top level changes:
- Updated github actions to reflect Arrow integration, ie: changing mac (brew) and ubuntu (apt) to add the libarrow and parquet dependencies.
- Created Arrow_test, much like hdf5, to test arrow functionality by mimicking an example use case from the apache arrow website.
- Added an option in CMakeLists.txt to look for arrow/parquet dependencies as well (if one is building from source)
The rationale behind conditional JSON + explicit conversion + conditional parquet: Thus far, I've yet to figure out how to integrate arrow (and its subdirectories) into OpenMS' contrib, since it doesn't follow the simple lib + headers + src structure well -- instead it contains src/parquet + src/gandiva + src/arrow + ... That is to say, since Windows' CI structure has it depend on contrib for its dependencies, I had to create a few extra conditionals for WITH_PARQUET and -DUSE_EXTERNAL_JSON.
Re: USE_EXTERNAL_JSON: It should be fairly decently commented, but essentially, mac w/ brew DOESN'T bundle nlohmann_json and ubuntu does w/ apt. As such, reconciling the two necessitates either a) changing the ubuntu package to not bundle with nlohmann_json in the first place, which doesn't seem doable, or b) making it so that if we're using Ubuntu, then we must utilize the Arrow's version of Nlohmann_JSON. I elected for (b)
This then, brought its own set of issues, the largest being that implicit conversion differed somehow across versions. Even though Nlohmann_json docs specify that it's automatically turned on with USE_IMPLICIT_CONVERSION (I believe), it seems as if Arrow's JSON only allows explicit conversion. So I turned Implicit conversion off in its cmake (for readability) and added a comment where Arrow's JSON was used to specity that the file must be written with explicit handling of Nlohmann JSON methods.
I recently attempted taking the explicit conversion logic out here: https://github.com/Avni2000/OpenMS/actions/runs/15864899276/job/44729833640 and the build failed because of said implicit conversion errors with Arrow's JSON.
Here's an example of where everything succeeds, at least as well as upstream/develop does: https://github.com/Avni2000/OpenMS/actions/runs/15865087724/job/44730344367
Checklist
- [ ] Make sure that you are listed in the AUTHORS file -> not yet
- [ ] Add relevant changes and new features to the CHANGELOG file -> nothing too big for non developers
- [X] I have commented my code, particularly in hard-to-understand areas
- [X] New and existing unit tests pass locally with my changes -> old unit tests fail without my changes, I believe.
- [X] Updated or added python bindings for changed or new classes (Tick if no updates were necessary.)
How can I get additional information on failed tests during CI
Click to expand
If your PR is failing you can check out- The details of the action statuses at the end of the PR or the "Checks" tab.
- http://cdash.seqan.de/index.php?project=OpenMS and look for your PR. Use the "Show filters" capability on the top right to search for your PR number. If you click in the column that lists the failed tests you will get detailed error messages.
Advanced commands (admins / reviewer only)
Click to expand
/reformat(experimental) applies the clang-format style changes as additional commit. Note: your branch must have a different name (e.g., yourrepo:feature/XYZ) than the receiving branch (e.g., OpenMS:develop). Otherwise, reformat fails to push.- setting the label "NoJenkins" will skip tests for this PR on jenkins (saves resources e.g., on edits that do not affect tests)
- commenting with
rebuild jenkinswill retrigger Jenkins-based CI builds
:warning: Note: Once you opened a PR try to minimize the number of pushes to it as every push will trigger CI (automated builds and test) and is rather heavy on our infrastructure (e.g., if several pushes per day are performed).
Summary by CodeRabbit
-
New Features
- Added optional support for Apache Arrow and Parquet file formats, controllable via a build option.
- Introduced comprehensive tests for reading and writing Arrow and Parquet files.
-
Bug Fixes
- Improved type safety and JSON (de)serialization in ribonucleotide parsing.
-
Chores
- Updated CI and build scripts to handle platform-specific dependencies for Arrow and Parquet.
- Enhanced .gitignore to exclude build directories.
- Updated dependency installation scripts for macOS and Ubuntu to include Apache Arrow and Parquet.
-
Documentation
- Clarified comments and error messages for better readability.
"""
Walkthrough
This update introduces conditional build and test support for Apache Arrow and Parquet libraries in the OpenMS project, primarily via new CMake options and CI workflow logic. It adds new tests for Arrow/Parquet IO, updates dependency installation scripts, and enhances JSON serialization for EmpiricalFormula. Several configuration and ignore files are also updated.
Changes
| Files/Paths | Change Summary |
|---|---|
| .github/workflows/openms_ci_matrix_full.yml, tools/ci/cibuild.cmake | Add environment variables WITH_PARQUET and USE_EXTERNAL_JSON; set conditionally in CI and load in build script. |
| .gitignore | Add OpenMS-build/* to ignored paths. |
| CMakeLists.txt, cmake/OpenMSConfig.cmake.in, cmake/cmake_findExternalLibs.cmake | Add WITH_PARQUET option; update comments; add Arrow/Parquet package detection if WITH_PARQUET is ON. |
| src/openms/CMakeLists.txt, src/tests/class_tests/openms/CMakeLists.txt, executables.cmake | Conditionally link Arrow and Parquet libraries and test executables if WITH_PARQUET is enabled. |
| src/openms/extern/CMakeLists.txt | Set JSON_USE_IMPLICIT_CONVERSIONS OFF when using external nlohmann_json. |
| src/openms/source/CHEMISTRY/RibonucleotideDB.cpp | Add custom JSON (de)serialization for EmpiricalFormula; update parsing logic for type safety. |
| src/tests/class_tests/openms/source/Arrow_test.cpp | New test file for Arrow and Parquet IO: creation, reading, and writing in multiple formats. |
| doc/doxygen/test/Doxygen_Warning_Checker.cpp | Change output message from "Doxygen warnings" to "Doxygen errors." |
| tools/ci/deps-macos.sh, tools/ci/deps-ubuntu.sh | Add Apache Arrow (and Parquet on Ubuntu) to required dependencies installation scripts. |
Sequence Diagram(s)
sequenceDiagram
participant CI as GitHub Actions CI
participant CMake as CMake Build System
participant OS as OS Environment
participant Arrow as Apache Arrow/Parquet
CI->>OS: Set WITH_PARQUET/USE_EXTERNAL_JSON based on OS
CI->>CMake: Trigger build with env vars
CMake->>CMake: Check WITH_PARQUET option
alt WITH_PARQUET=ON
CMake->>Arrow: find_package(Arrow, Parquet)
CMake->>CMake: Link Arrow/Parquet to targets
end
CMake->>CMake: Build OpenMS with/without Arrow/Parquet
sequenceDiagram
participant Test as Arrow_test.cpp
participant Arrow as Apache Arrow
participant Parquet as Apache Parquet
Test->>Arrow: Create Arrow Table (days, months, years)
Test->>Arrow: Write to IPC, CSV
Test->>Parquet: Write to Parquet
Test->>Arrow: Read IPC, write new IPC
Test->>Arrow: Read CSV, write new CSV
Test->>Parquet: Read Parquet, write new Parquet
Suggested labels
Review effort 2/5
Suggested reviewers
- timosachsenberg
Poem
In the land where data streams and flows,
Parquet and Arrow now compose—
With CMake flags and JSON neat,
Our builds and tests are quite complete!
Ubuntu, Mac, and Windows too,
The bunny hops through tasks anew.
🐇✨ """
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
🪧 Tips
Chat
There are 3 ways to chat with CodeRabbit:
- Review comments: Directly reply to a review comment made by CodeRabbit. Example:
I pushed a fix in commit <commit_id>, please review it.Explain this complex logic.Open a follow-up GitHub issue for this discussion.
- Files and specific lines of code (under the "Files changed" tab): Tag
@coderabbitaiin a new review comment at the desired location with your query. Examples:@coderabbitai explain this code block.@coderabbitai modularize this function.
- PR comments: Tag
@coderabbitaiin a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:@coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.@coderabbitai read src/utils.ts and explain its main purpose.@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.@coderabbitai help me debug CodeRabbit configuration file.
Support
Need help? Create a ticket on our support page for assistance with any issues or questions.
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.
CodeRabbit Commands (Invoked using PR comments)
@coderabbitai pauseto pause the reviews on a PR.@coderabbitai resumeto resume the paused reviews.@coderabbitai reviewto trigger an incremental review. This is useful when automatic reviews are disabled for the repository.@coderabbitai full reviewto do a full review from scratch and review all the files again.@coderabbitai summaryto regenerate the summary of the PR.@coderabbitai generate docstringsto generate docstrings for this PR.@coderabbitai generate sequence diagramto generate a sequence diagram of the changes in this PR.@coderabbitai resolveresolve all the CodeRabbit review comments.@coderabbitai configurationto show the current CodeRabbit configuration for the repository.@coderabbitai helpto get help.
Other keywords and placeholders
- Add
@coderabbitai ignoreanywhere in the PR description to prevent this PR from being reviewed. - Add
@coderabbitai summaryto generate the high-level summary at a specific location in the PR description. - Add
@coderabbitaianywhere in the PR title to generate the title automatically.
CodeRabbit Configuration File (.coderabbit.yaml)
- You can programmatically configure CodeRabbit by adding a
.coderabbit.yamlfile to the root of your repository. - Please see the configuration documentation for more information.
- If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation:
# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json
Documentation and Community
- Visit our Documentation for detailed information on how to use CodeRabbit.
- Join our Discord Community to get help, request features, and share feedback.
- Follow us on X/Twitter for updates and announcements.
@timosachsenberg @jpfeuffer would love to get your feedback on this, particularly the somewhat convoluted Nlohmann_json part. Let me know if I can clarify anything.
First batch of general comments:
Arrow does not necessarily need to be in contrib. You could just install it from vcpkg or conda. But even if you put it in contrib, the structure with subfolders should not be a problem. Our contrib basically calls their CMake commands and installs it in a specific output folder in a way that find_package scripts find the output artifacts there. See boost, which is a very complicated build system (not even Cmake) with multiple subpackages.
Yes, implicit conversions should definitely go! It will be deprecated in the next json version anyway.
Should be as easy as adding conversion at the points where we are filling string fields with json entry objects and vice-versa.. Just use entry.get<std::string>() everywhere.
Yes, nlohmann_json should always be the same as the one used by arrow (if used). Therefore nlohmann_json should always be first looked up with find_package and only if not available built together with OpenMS. If arrow has a good CMake config, they will just call find_dependecy for you and you will get the nlohmann_json target anyway for free.
I don't really understand if the json explicit conversion is working now or not? You have changes here that apparently pass on Linux. So what's working and what is not?
I don't really understand if the json explicit conversion is working now or not? You have changes here that apparently pass on Linux. So what's working and what is not?
@jpfeuffer @timosachsenberg We have 2 versions of json involved and two versions of arrow involved. Json explict conversion was always working, the problem was with implicit conversions.
I should've clarified more and I'm a maths major so I'll use some more specific terminology.
Let there be two versions of Arrow: Brew_Arrow and Apt_Arrow, corresponding to mac and linux installations respectively.
Brew_Arrow does NOT bundle ANY Nlohmann json dependencies with it. Apt_Arrow does, likely because the Apt_Arrow is from apache's latest arrow version (note the wget)
So, how do we reconcile these differences? On one end, we just introduced an external json with Apt_Arrow, but ONLY for Apt_Arrow. On the other end, we have Brew_Arrow which is probably a bit behind without said bundled Nlohmann Json. And like you'd mentioned, we'd prefer to support the explicit conversion regardless.
I'm unsure why, and you can look at my original comment at the top where it threw a bunch of implicit conversion errors, even though it was turned on (see git logs for IMPLICIT_CONVERSION), but adding in Nlohmann_JSON on linux did require me to a) turn on USE_EXTERNAL_JSON (which is intuitive and a given) but also b) convert implicit conversions to explicit in ribonucleotideDB
First batch of general comments:
Arrow does not necessarily need to be in contrib. You could just install it from vcpkg or conda. But even if you put it in contrib, the structure with subfolders should not be a problem. Our contrib basically calls their CMake commands and installs it in a specific output folder in a way that find_package scripts find the output artifacts there. See boost, which is a very complicated build system (not even Cmake) with multiple subpackages.
Yes, implicit conversions should definitely go! It will be deprecated in the next json version anyway. Should be as easy as adding conversion at the points where we are filling string fields with json entry objects and vice-versa.. Just use
entry.get<std::string>()everywhere.Yes, nlohmann_json should always be the same as the one used by arrow (if used). Therefore nlohmann_json should always be first looked up with find_package and only if not available built together with OpenMS. If arrow has a good CMake config, they will just call find_dependecy for you and you will get the nlohmann_json target anyway for free.
On that subfolders part, I'm trying out a new approach with cmake like here: cmake config + utilization and I'll open up an issue by end of Friday (CST) if I can't figure it out and show you my approach.
I get it now. The reason why you saw the errors on Linux is most likely that your CMake setting had no effect on the already built external json lib. You need to set a compile macro as explained in the comment above. If everything is working with explicit conversions now, you should remove your attempt to set implicit conversions to ON for external. And at the same time set that option to OFF in our internal build as well! To keep things consistent.
I get it now. The reason why you saw the errors on Linux is most likely that your CMake setting had no effect on the already built external json lib. You need to set a compile macro as explained in the comment above. If everything is working with explicit conversions now, you should remove your attempt to set implicit conversions to ON for external. And at the same time set that option to OFF in our internal build as well! To keep things consistent.
Agreed, when you mentioned the target compile thing yesterday, I thought this was the case as well - thanks for the help
@jpfeuffer I have the option to export an arrow config file with -DARROW_JSON=ON - should I opt to use this instead and get rid of Nlohmann json through extern entirely?
Hi! So, if we go the contrib way (remember there might be other ways to get arrow on Win), arrow should be built with as few optional dependencies as possible. We have to find out which those are. However, ideally, since we depend on nlohmann_json ourselves already, nlohmann_json should be refactored to be built in contrib as well, built before arrow, and then used during the arrow build. Lastly we link to both in OpenMS.
Having said that: we are basically building our own dependency manager here and we are SO CLOSE to finish external dependency management in BOTH conda AND vcpkg, I personally think that now would be a great time to finally get one of those running instead.
The way we do it in contrib is particularly convoluted, I agree. Does that mean that we nix contrib altogether and fix the corresponding bugs re. cmakelists? Or does it just mean replace contrib w/ conda/vcpkg on windows, and then continue with contrib on the actual prod version of OpenMS?
I'm curious as to how that would get handled by the current cmakelists in OpenMS (excluding the contrib submodule). Are you also shipping an environment.yml for conda? Or since cmakelists handles installations and not the actual install portion, does that imply we're just leaving them alone?
TLDR: confused about how conda/vcpkg fits into the build system. I'm more than open to trying anything, and if you want to point me towards a branch/fork wherein you're working on this, I'd love to take a look and adapt my code with the expectation of it being merged soon.
Conda or vcpkg will be the thing that installs/builds the dependencies. You just point OpenMS' Cmake to where they install the libs and our CMake should pick them up. The only problems are usually that Cmake configs or naming schemes of libraries often differ from installing via brew or apt and you need some tweaks in CMake.
You could actually install just arrow with Conda but you have to hope that Conda builds the library with a compatible environment (C compiler, runtime). On windows it might actually work.
conda env create -p ./.conda -f env.yml
cmake -DCMAKE_PREFIX_PATH="./conda/Library;./conda" ../OpenMS
vcpkg has full integration with cmake. You just need one line or so in CMake to activate a .vcpkg.json file in the repo (which already exists but is probably not working completely). There is a PR here somewhere but maybe one should start from scratch with their documentation.
@Avni2000 I think you could give installing arrow on windows with conda a try. In contrast to vcpkg you get already compiled binaries. vcpkg is often more compatible but you need to actually build the library. Don't know if this is too much for our CI.
For vcpkg you could also enable binary caching and like we do for contrib and just publish to GH packages. Then register this as a source: https://learn.microsoft.com/en-us/vcpkg/consume/binary-caching-github-packages?pivots=windows-runner
For vcpkg you could also enable binary caching and like we do for contrib and just publish to GH packages. Then register this as a source: https://learn.microsoft.com/en-us/vcpkg/consume/binary-caching-github-packages?pivots=windows-runner
ah didn't know that this now works through GH packages
Then I would actually prefer to get rid of contrib and use vcpkg with caching - if (!) this works for arm etc.
@Avni2000 but maybe a quick check with conda to get the PR merged would be what I would prioritize now. Switching to vcpkg could be done in another PR.
failing test unrelated... let's merge