chakra icon indicating copy to clipboard operation
chakra copied to clipboard

Protobuf version mismatch

Open jinsun-yoo opened this issue 6 months ago • 5 comments

Describe the Bug

Python scripts complain about protobuf mismatch between gencode(protoc) and runtime(protobuf) versions.

Steps to Reproduce

docker run -it --rm  -w /tmp ubuntu:22.04 bash

#Within the container
apt -y update; apt -y upgrade
apt -y install coreutils wget vim git make cmake python3.10 python3-pip
pip install --upgrade pip
git clone https://github.com/mlcommons/chakra.git
cd chakra/
pip install .
chakra_generator

#Error message
root@28df32360644:/tmp/chakra# chakra_generator                                                                                                                                                                           
Traceback (most recent call last):                                                                                                                                                                                        
  File "/usr/local/bin/chakra_generator", line 5, in <module>
    from chakra.src.generator.generator import main
  File "/usr/local/lib/python3.10/dist-packages/chakra/src/generator/generator.py", line 3, in <module>
    from ...schema.protobuf.et_def_pb2 import (
  File "/usr/local/lib/python3.10/dist-packages/chakra/schema/protobuf/et_def_pb2.py", line 12, in <module>
    _runtime_version.ValidateProtobufRuntimeVersion(
  File "/usr/local/lib/python3.10/dist-packages/google/protobuf/runtime_version.py", line 106, in ValidateProtobufRuntimeVersion
    _ReportVersionError(
  File "/usr/local/lib/python3.10/dist-packages/google/protobuf/runtime_version.py", line 50, in _ReportVersionError
    raise VersionError(msg)
google.protobuf.runtime_version.VersionError: Detected mismatched Protobuf Gencode/Runtime major versions when loading et_def.proto: gencode 6.31.0 runtime 5.29.5. Same major version is required. See Protobuf version guarantees at https://protobuf.dev/support/cross-version-runtime-guarantee.

Possible Solutions

The two possible fixes are:

  1. Edit pyproject.toml to accept protobuf==6.*
  2. Find a way to configure setuptools-grpc to accept a specific protoc version (Very much unlikely from my impression of the source code, but still)

Detailed explanation:

Copy & Pasted from https://github.com/astra-sim/astra-sim/pull/314 .

TL, DR: As long as Chakra uses setuptools-grpc in the build process to compile .proto files, regardless of what is written in pyproject.toml, the generated pb2.py file will always be compiled with whatever protoc the grpc repo points to (which is v.6.31.0 as of today)

  • gencode 6.31.0 refers to the protoc version used to convert et_def.proto into et_def_pb2.py.
  • runtime 5.29.0 refers to the pip protobuf package, that is triggered when running python scripts like python gen_chakra_traces.py.

We know that pip install . installs protobuf v5.29.0 and uses it, as it is the dependency defined in Chakra's pyproject.toml file. Then where does the 6.31.0 come from?

When running pip install ., Chakra's pyproject.toml declares that it will use setuptools, specifically a plugin called setuptools-grpc whose repo is available (here)[https://github.com/CZ-NIC/setuptools-grpc]. setuptools-grpc imports and uses grpc_tools.protoc. That is, setuptools-grpc does not use the system installed, or even the pip installed protobuf/protoc to compile et_def.proto into et_def_pb2.py. This can be verified by trying to run pip install . without installing protoc in either the system (apt-get -y ...) or pip. You can see that the pb2.py files are created nontheless.

Then what protoc version does setuptools-grpc use? If you go into the grpc repo which holds grpc_tools, you can see that it imports the source code of the protocol buffer as a submodule, and compiles protoc from source. That is, any code calling grpc_tools.protoc is fixed to whichever protoc version is submoduled in the grpc repo.

Looking into the code, we see that currently (2025-JUN-12), the grpc repo points to commit 3d4adad of the protobuf repo. This commit has a tag of v31.0, which corresponds to v6.31.0 in Python. That's where the gencode 6.31.0 comes from.

More notes: Why does v31.0 correspond to v6.31.0 "in Python"?

Ref: https://protobuf.dev/support/version-support/

The protobuf repo has a repo-wide version, which consists only of minor/patch numbers (v31.0, v29.5, etc). For each repo-wide version, each language may use different major versions. For example, for the repo-wide version v31.0, C++ and Python calls it v6.31.0, while Ruby calls it v4.31.0. This is because the same repo-wide update may have a groundbreaking change in one language, while it has minimal effect in another language. That is, C++ v6.31.0 and Ruby v4.31.0 points to the same commit in the protobuf repo.

So, once we know the repo-wide version v31.0, we look up the table in the link above to find the major number for Python

jinsun-yoo avatar Jun 12 '25 06:06 jinsun-yoo

@jinsun-yoo thanks for raising this, yesterday I struggled a lot with this issue and I really did not know if it's something on my end or not. As you have highlighted, changing the proto compiler version embedded in setuptools-grpc seems a nightmare. Ultimately, I started using 6.31.0, so basically your proposed solution 1. and I did not face any issue.

theodorbadea avatar Jun 13 '25 12:06 theodorbadea

The gencode version is picked up based on the version of python you are using. Python 3.8 worked fine without any errors for me. (3.8.20 to be exact)

sanalcc1 avatar Jun 17 '25 04:06 sanalcc1

I have a question. If I want to build my own Dockerfile, where should I place pip install protobuf==6.31.0 ? Below is my Dockerfile.

ARG VERSION
FROM rocm/pytorch:${VERSION}

# Update package list and install OpenMPI
RUN apt-get update && apt-get install -y \
    cmake \
    # OpenMPI
    openmpi-bin openmpi-common libopenmpi-dev \
    # Dependencies required by ASTRA-sim
    libprotobuf-dev protobuf-compiler \
    libboost-dev libboost-program-options-dev \
    graphviz

# Set Horovod environment variables
ENV HOROVOD_GPU=ROCM \
    HOROVOD_ROCM_HOME=/opt/rocm \
    HOROVOD_GPU_OPERATIONS=NCCL \
    HOROVOD_WITHOUT_TENSORFLOW=1 \
    HOROVOD_WITH_PYTORCH=1 \
    HOROVOD_WITH_MPI=1 \
    HOROVOD_WITHOUT_MXNET=1

# Install Horovod and other dependencies
RUN git clone --recursive https://github.com/horovod/horovod.git /workspace/horovod
RUN ln -s $ROCM_PATH/lib/cmake/hip/FindHIP* /workspace/horovod/cmake/Modules/
RUN sed -i 's/rccl\.h/rccl\/rccl\.h/' /workspace/horovod/horovod/common/ops/nccl_operations.h
RUN pip install --no-cache-dir /workspace/horovod/.

# Install Python protobuf package
RUN pip install protobuf==5.29.0

# Install ASTRA-sim
ENV ASTRA_SIM="/workspace/astra-sim"
RUN git clone --recursive https://github.com/astra-sim/astra-sim.git ${ASTRA_SIM}
WORKDIR ${ASTRA_SIM}
RUN ./build/astra_analytical/build.sh
ENV ASTRA_SIM_BIN=${ASTRA_SIM}/build/astra_analytical/build/bin/AstraSim_Analytical_Congestion_Aware

# https://github.com/mlcommons/chakra/issues/195
RUN pip install protobuf==6.31.0

# Compile ASTRA-sim's ns3 module
# C++ 20 introduced a new fmt library, which may cause compilation issues with some older ASTRA-sim code.
WORKDIR /workspace/astra-sim/extern/helper/spdlog_setup
RUN sed -i 's/\bformat(/fmt::format(/g' conf.h details/conf_impl.h details/template_impl.h \
    && sed -i 's/fmt::fmt::format(/fmt::format(/g' conf.h details/conf_impl.h details/template_impl.h \
    && echo '#include <spdlog/fmt/fmt.h>' | cat - conf.h > temp && mv temp conf.h \
    && echo '#include <spdlog/fmt/fmt.h>' | cat - details/conf_impl.h > temp && mv temp details/conf_impl.h \
    && echo '#include <spdlog/fmt/fmt.h>' | cat - details/template_impl.h > temp && mv temp details/template_impl.h
WORKDIR ${ASTRA_SIM}
RUN ./build/astra_ns3/build.sh -c

# Install Chakra
RUN pip install --no-cache-dir \
    ${ASTRA_SIM}/extern/graph_frontend/chakra

# Install other Chakra dependencies: Param
# REF https://github.com/mlcommons/chakra/blob/main/USER_GUIDE.md
# Install PARAM (Chakra dependency)
WORKDIR /workspace
RUN git clone https://github.com/facebookresearch/param.git && \
    cd param/et_replay && \
    git checkout 7b19f586dd8b267333114992833a0d7e0d601630 && \
    pip install .

# Install Holistic Trace Analysis (for trace analysis)
WORKDIR /workspace
RUN git clone https://github.com/facebookresearch/HolisticTraceAnalysis.git && \
    cd HolisticTraceAnalysis && \
    git checkout d731cc2e2249976c97129d409a83bd53d93051f6 && \
    git submodule update --init && \
    pip install -r requirements.txt && \
    pip install -e .

# Set working directory
WORKDIR /workspace

I’ve tried installing different versions, but I still get the same error.

jjasoncool avatar Jun 19 '25 07:06 jjasoncool

I found the solution: I moved the RUN pip install protobuf==6.31.0 command to after the Chakra installation in the Dockerfile, and it now works fine.

jjasoncool avatar Jun 19 '25 08:06 jjasoncool

@jinsun-yoo I noticed you've addressed the protobuf version in PR#202. Can this issue be closed?

theodorbadea avatar Sep 26 '25 07:09 theodorbadea