serving icon indicating copy to clipboard operation
serving copied to clipboard

Unable to build Docker image for TFServing for TPU (`filesystem error: cannot make canonical path` with empty library path)

Open arpitagarwal-meesho opened this issue 2 months ago • 0 comments

Issue Summary

TensorFlow Serving built with TPU support fails to run in Docker containers with the error:

tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:95] Opening library: 
terminate called after throwing an instance of 'std::filesystem::filesystem_error'
  what():  filesystem error: cannot make canonical path: Invalid argument []

The TPU library path is empty despite setting TPU_LIBRARY_PATH, LD_LIBRARY_PATH, and LD_PRELOAD environment variables correctly. This makes TPU-based TensorFlow Serving deployments impossible in containerized environments (Docker, Kubernetes).

Environment

  • TensorFlow Serving version: master branch (commit: 5fb3b1fefda9320202da184752a3366fbeddfeac)
  • Platform: Google Cloud TPU VM (v4)
  • Base OS: Ubuntu 22.04
  • Python: 3.10.18
  • Bazel: 7.4.1
  • libtpu.so: Installed via pip install libtpu (version from PyPI)
  • Deployment: Docker container (intended for Kubernetes)

Expected Behavior

TensorFlow Serving should:

  1. Read TPU_LIBRARY_PATH environment variable to locate libtpu.so
  2. Successfully initialize TPU support in containerized environments
  3. Serve models using TPU hardware for inference

Actual Behavior

The tpu_api_dlsym_initializer.cc:95 function attempts to open a library with an empty path string, causing immediate crash. The TPU initialization code appears to bypass environment variables entirely.

Reproduction Steps

  1. Build Custom TensorFlow Serving with TPU Support Dockerfile.devel-tpu:
FROM ubuntu:22.04 as base_build

ENV DEBIAN_FRONTEND=noninteractive

# Install dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    automake build-essential ca-certificates curl git \
    libcurl4-openssl-dev libfreetype6-dev libpng-dev libtool \
    libzmq3-dev openjdk-11-jdk pkg-config python3.10 \
    python3.10-dev python3-pip swig unzip wget zip zlib1g-dev \
    && apt-get clean && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
RUN pip3 install --no-cache-dir --upgrade pip setuptools && \
    pip3 install --no-cache-dir future grpcio h5py keras_applications \
    keras_preprocessing mock numpy portpicker requests

# Install Bazel 7.4.1
ENV BAZEL_VERSION=7.4.1
RUN mkdir /bazel && cd /bazel && \
    curl -fSsL -O https://github.com/bazelbuild/bazel/releases/download/${BAZEL_VERSION}/bazel-${BAZEL_VERSION}-installer-linux-x86_64.sh && \
    chmod +x bazel-*.sh && ./bazel-${BAZEL_VERSION}-installer-linux-x86_64.sh && \
    rm -rf /bazel

# Copy TensorFlow Serving source
WORKDIR /tensorflow-serving
COPY . .

# Patch .bazelrc to disable GCE-specific TPU flags
RUN sed -i 's/^build:tpu --copt=-DLIBTPU_ON_GCE/#build:tpu --copt=-DLIBTPU_ON_GCE/' .bazelrc && \
    echo "build:tpu-docker --define=with_tpu_support=true" >> .bazelrc && \
    echo "build:tpu-docker --define=framework_shared_object=false" >> .bazelrc

# Build with TPU support
RUN bazel build --config=release --config=tpu-docker \
    --verbose_failures \
    tensorflow_serving/model_servers:tensorflow_model_server && \
    cp bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server /usr/bin/

CMD ["/bin/bash"]
  1. Create Runtime Image with libtpu.so Dockerfile.tpu:
FROM tensorflow-serving-devel-tpu as build_image
FROM ubuntu:22.04

# Install runtime dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    ca-certificates libgomp1 python3.10 python3-pip \
    && rm -rf /var/lib/apt/lists/*

# Copy TensorFlow Serving binary
COPY --from=build_image /usr/bin/tensorflow_model_server /usr/bin/tensorflow_model_server

# Install libtpu via pip
RUN pip3 install --no-cache-dir libtpu

# Create symlinks to standard locations
RUN LIBTPU_PATH=$(find /usr/local/lib -name "libtpu.so" | head -1) && \
    mkdir -p /lib/libtpu /usr/lib && \
    ln -sf "$LIBTPU_PATH" /lib/libtpu/libtpu.so && \
    ln -sf "$LIBTPU_PATH" /usr/lib/libtpu.so

# Set environment variables
ENV TPU_LIBRARY_PATH=/lib/libtpu/libtpu.so
ENV LD_LIBRARY_PATH=/usr/lib:/usr/local/lib:/lib/libtpu
ENV LD_PRELOAD=/lib/libtpu/libtpu.so

EXPOSE 8500 8501
CMD ["/usr/bin/tensorflow_model_server"]
  1. Run Container
docker build -t tensorflow-serving-devel-tpu -f Dockerfile.devel-tpu .
docker build -t tensorflow-serving-tpu -f Dockerfile.tpu .

docker run -p 8500:8500 -p 8501:8501 \
  --mount type=bind,source=/path/to/model,target=/models/my_model \
  -e MODEL_NAME=my_model \
  tensorflow-serving-tpu

Result: Immediate crash with filesystem error.

What Has Been Tried

  1. Environment Variables: Set TPU_LIBRARY_PATH, LD_LIBRARY_PATH, LD_PRELOAD
  2. Multiple Library Locations: Symlinked libtpu.so to /lib/libtpu/, /usr/lib/, /usr/local/lib/
  3. Removed GCE-Specific Flags: Commented out -DLIBTPU_ON_GCE in .bazelrc
  4. Custom Build Configs: Tried --config=tpu-docker and direct --define=with_tpu_support=true
  5. Different libtpu Sources:
  • Copied from TPU VM host (/usr/local/lib/python3.10/dist-packages/libtpu/libtpu.so)
  • Installed via pip install libtpu
  • Attempted torch-xla[tpu] installation
  1. Verified Binary: Confirmed TPU support is compiled in (checked with strings)
  2. Verified Library: Confirmed libtpu.so exists and is accessible (339MB, valid ELF)

Root Cause Analysis

The tpu_api_dlsym_initializer.cc code in TensorFlow core appears to have hardcoded logic that:

  1. Only works in Google Cloud Engine (GCE) VM environments
  2. Returns an empty string when not in GCE context
  3. Does NOT properly fall back to checking TPU_LIBRARY_PATH environment variable
  4. Fails during static initialization before environment variables can take effect
  5. The relevant code path (tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:95) logs:

Opening library: [empty string] This suggests GetLibraryPath() returns "" instead of reading from environment variables.

Workarounds Attempted (All Failed)

  1. Mounting host libtpu.so into container
  2. Using LD_PRELOAD to force library loading
  3. Multiple symlinks in every possible path
  4. Patching .bazelrc to remove GCE-specific compilation flags

Questions

  1. Is TPU support in Docker containers officially supported?
  2. If not, are there plans to add support?
  3. Can tpu_api_dlsym_initializer.cc be updated to properly check TPU_LIBRARY_PATH in non-GCE environments?
  4. Are there internal Google builds/configurations that work in containers?

arpitagarwal-meesho avatar Oct 09 '25 17:10 arpitagarwal-meesho