Move yyjson dependency to PAX storage directory
Move yyjson from top-level dependency directory to contrib/pax_storage/src/cpp/contrib/yyjson since it's only used by the PAX storage component. Update all references in CMake build files, .gitmodules, and documentation to reflect the new location. This change improves the project structure by keeping PAX-specific dependencies together and removing the unnecessary top-level dependency directory.
The move includes:
- Moving yyjson directory to contrib/pax_storage/src/cpp/contrib/yyjson
- Updating CMakeLists.txt and pax.cmake to reference the new location
- Updating .gitmodules to reflect the new submodule path
- Updating README.md documentation to show the new location
Fixes #ISSUE_Number
What does this PR do?
Type of Change
- [ ] Bug fix (non-breaking change)
- [ ] New feature (non-breaking change)
- [ ] Breaking change (fix or feature with breaking changes)
- [ ] Documentation update
Breaking Changes
Test Plan
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Passed
make installcheck - [ ] Passed
make -C src/test installcheck-cbdb-parallel
Impact
Performance:
User-facing changes:
Dependencies:
Checklist
- [ ] Followed contribution guide
- [ ] Added/updated documentation
- [ ] Reviewed code for security implications
- [ ] Requested review from cloudberry committers
Additional Context
CI Skip Instructions
Hi @tuhaihe !
I reviewed your PR and wanted to share an alternative implementation that eliminates the need for git submodules entirely. This approach uses CMake's FetchContent module, which I've seen work well in other Apache projects like MADlib.
Alternative: CMake FetchContent
I've implemented this on branch fetchcontent-yyjson for your consideration.
Key change: Replace the git submodule with automatic dependency fetching during CMake configuration.
Implementation
contrib/pax_storage/CMakeLists.txt:
if(USE_MANIFEST_API AND NOT USE_PAX_CATALOG)
include(FetchContent)
# Try system package first
find_package(yyjson QUIET)
if(NOT yyjson_FOUND)
message(STATUS "yyjson not found in system, fetching from GitHub...")
FetchContent_Declare(
yyjson
GIT_REPOSITORY https://github.com/ibireme/yyjson.git
GIT_TAG 0.12.0
GIT_SHALLOW TRUE
)
set(SAVED_BUILD_SHARED_LIBS ${BUILD_SHARED_LIBS})
set(BUILD_SHARED_LIBS ON)
FetchContent_MakeAvailable(yyjson)
set(BUILD_SHARED_LIBS ${SAVED_BUILD_SHARED_LIBS})
else()
message(STATUS "Using system yyjson package")
endif()
endif()
contrib/pax_storage/src/cpp/cmake/pax.cmake:
# Update include path to use FetchContent variable
set(pax_target_include ${pax_target_include} ${yyjson_SOURCE_DIR}/src)
Remove from .gitmodules - No submodule entry needed.
Benefits
- Simpler workflow - No
git submodule update --init --recursiverequired - Automatic handling - CMake downloads yyjson when needed during configuration
- Version explicit -
GIT_TAG 0.12.0(latest release) is clearer than a commit hash- Submodules point to opaque commit SHAs - without checking the yyjson repo, you can't tell if it's a release, a random commit, or how old it is
- FetchContent uses semantic versions - immediately clear you're using the latest stable release
- System package support - Respects system-installed yyjson if available
- Pattern consistency - Aligns with how protobuf and zstd are already handled via
find_package()
Important Note
During my review, I noticed that yyjson is only used when both USE_MANIFEST_API=ON and USE_PAX_CATALOG=OFF, which is not the default configuration. This means:
- Default builds (
./configure --enable-pax) don't use yyjson at all - Your reorganization and this alternative have zero impact on normal builds
- The dependency is only relevant for the JSON manifest implementation mode
Testing
To verify this works when yyjson IS needed:
cd contrib/pax_storage/build
cmake .. -DUSE_MANIFEST_API=ON -DUSE_PAX_CATALOG=OFF
# Should see: "yyjson not found in system, fetching from GitHub..."
make
Recommendation
Your PR's organizational improvement is valuable regardless. This FetchContent approach is simply an alternative that:
- Reduces the dependency management burden (no submodule commands)
- Follows patterns used in other Apache projects
- Makes the build more self-contained
Both approaches work - this is just offered as food for thought. Happy to discuss!
Branch for reference: fetchcontent-yyjson
Hi @edespino, thanks for your great idea. There are other submodules needed for PAX building, such as tabulate. I prefer to keep the same behavior across yyjson and the other submodules so that users can download all required submodules with git submodule update --init --recursive once, rather than downloading yyjson separately during the build.
More:
I searched and found that yyjson is available via the EPEL repository on Rocky Linux 9, but it is not available in Ubuntu 22.04 (though it can be found in Ubuntu 25.04). Given that it is not a commonly provided package across popular Linux distros, we can continue using the submodule approach for now.
Would like to have more voices from the core PAX developers on this. cc @gfphoenix78 @jiaqizho @gongxun0928
At least, it's not good to place the yyjson separately at the top of the dir.
Subject: Release Engineering View – Using EPEL vs. Building from Source
When building on Rocky Linux 9, EPEL can be convenient for missing dependencies, but it introduces several trade-offs worth noting:
Key Issues
- Reproducibility: EPEL is rolling; rebuilds can silently pick up new versions. Exact, bit-reproducible releases become hard to guarantee.
- Supply-chain control: Packages are maintained outside ASF governance. Building from source keeps provenance and signatures within our audit trail.
- Licensing clarity: ASF policy expects verifiable source for all distributed code. Relying on EPEL binaries delegates license vetting to Fedora.
- Runtime portability: Binaries linked against EPEL libraries may fail on systems where EPEL isn’t enabled or where ABI flags differ.
- Longevity: EPEL mirrors move forward; old package versions aren’t preserved, complicating long-term rebuilds.
Recommended Approach
- Build third-party libraries from verified upstream source whenever they appear in our release artifacts.
- Use EPEL only for transient developer tools, not for runtime dependencies.
- If EPEL is required in CI, snapshot or mirror it to fix versions.
- Record all external sources and hashes in
DEPENDENCIESor release notes.
Bottom Line
EPEL is fine for development convenience, but from a release-engineering and ASF-compliance standpoint, source builds give stronger reproducibility, auditability, and long-term stability.
Hi @edespino, thanks for your great idea. There are other submodules needed for PAX building, such as tabulate. I prefer to keep the same behavior across yyjson and the other submodules so that users can download all required submodules with
git submodule update --init --recursiveonce, rather than downloadingyyjsonseparately during the build.
One thing I'm curious about is why the selected submodule commit shas were used. It is not important for the relocation work. But it wasn't obvious at all why the yyjson submodule commit sha was used.
Would like to have more voices from the core PAX developers on this. cc @gfphoenix78 @jiaqizho @gongxun0928
At least, it's not good to place the
yyjsonseparately at the top of the dir.
The reason why dependency/yyjson is separated and stored instead of being placed under PAX is because not only PAX depends on yyjson. Does the core of cloud also rely on yyjson?
@gfphoenix78 do you still have any context about this module?
Would like to have more voices from the core PAX developers on this. cc @gfphoenix78 @jiaqizho @gongxun0928 At least, it's not good to place the
yyjsonseparately at the top of the dir.The reason why
dependency/yyjsonis separated and stored instead of being placed under PAX is because not only PAX depends onyyjson. Does the core of cloud also rely onyyjson?
Thanks @jiaqizho for your reply. I remember our private conversation on this topic. ❤️
Now, from the open-source side, I believe we can move the yyjson to the correct place. If some downstream vendors refer to it as a dependency, the downstream should have a patch for it or clone it separately.