Unable to build Docker image for TFServing for TPU (`filesystem error: cannot make canonical path` with empty library path)

Open arpitagarwal-meesho opened this issue 2 months ago • 0 comments

Issue Summary

TensorFlow Serving built with TPU support fails to run in Docker containers with the error:

tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:95] Opening library: 
terminate called after throwing an instance of 'std::filesystem::filesystem_error'
  what():  filesystem error: cannot make canonical path: Invalid argument []

The TPU library path is empty despite setting TPU_LIBRARY_PATH, LD_LIBRARY_PATH, and LD_PRELOAD environment variables correctly. This makes TPU-based TensorFlow Serving deployments impossible in containerized environments (Docker, Kubernetes).

Environment

TensorFlow Serving version: master branch (commit: 5fb3b1fefda9320202da184752a3366fbeddfeac)
Platform: Google Cloud TPU VM (v4)
Base OS: Ubuntu 22.04
Python: 3.10.18
Bazel: 7.4.1
libtpu.so: Installed via pip install libtpu (version from PyPI)
Deployment: Docker container (intended for Kubernetes)

Expected Behavior

TensorFlow Serving should:

Read TPU_LIBRARY_PATH environment variable to locate libtpu.so
Successfully initialize TPU support in containerized environments
Serve models using TPU hardware for inference

Actual Behavior

The tpu_api_dlsym_initializer.cc:95 function attempts to open a library with an empty path string, causing immediate crash. The TPU initialization code appears to bypass environment variables entirely.

Reproduction Steps

Build Custom TensorFlow Serving with TPU Support Dockerfile.devel-tpu:

FROM ubuntu:22.04 as base_build

ENV DEBIAN_FRONTEND=noninteractive

# Install dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    automake build-essential ca-certificates curl git \
    libcurl4-openssl-dev libfreetype6-dev libpng-dev libtool \
    libzmq3-dev openjdk-11-jdk pkg-config python3.10 \
    python3.10-dev python3-pip swig unzip wget zip zlib1g-dev \
    && apt-get clean && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
RUN pip3 install --no-cache-dir --upgrade pip setuptools && \
    pip3 install --no-cache-dir future grpcio h5py keras_applications \
    keras_preprocessing mock numpy portpicker requests

# Install Bazel 7.4.1
ENV BAZEL_VERSION=7.4.1
RUN mkdir /bazel && cd /bazel && \
    curl -fSsL -O https://github.com/bazelbuild/bazel/releases/download/${BAZEL_VERSION}/bazel-${BAZEL_VERSION}-installer-linux-x86_64.sh && \
    chmod +x bazel-*.sh && ./bazel-${BAZEL_VERSION}-installer-linux-x86_64.sh && \
    rm -rf /bazel

# Copy TensorFlow Serving source
WORKDIR /tensorflow-serving
COPY . .

# Patch .bazelrc to disable GCE-specific TPU flags
RUN sed -i 's/^build:tpu --copt=-DLIBTPU_ON_GCE/#build:tpu --copt=-DLIBTPU_ON_GCE/' .bazelrc && \
    echo "build:tpu-docker --define=with_tpu_support=true" >> .bazelrc && \
    echo "build:tpu-docker --define=framework_shared_object=false" >> .bazelrc

# Build with TPU support
RUN bazel build --config=release --config=tpu-docker \
    --verbose_failures \
    tensorflow_serving/model_servers:tensorflow_model_server && \
    cp bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server /usr/bin/

CMD ["/bin/bash"]

Create Runtime Image with libtpu.so Dockerfile.tpu:

FROM tensorflow-serving-devel-tpu as build_image
FROM ubuntu:22.04

# Install runtime dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    ca-certificates libgomp1 python3.10 python3-pip \
    && rm -rf /var/lib/apt/lists/*

# Copy TensorFlow Serving binary
COPY --from=build_image /usr/bin/tensorflow_model_server /usr/bin/tensorflow_model_server

# Install libtpu via pip
RUN pip3 install --no-cache-dir libtpu

# Create symlinks to standard locations
RUN LIBTPU_PATH=$(find /usr/local/lib -name "libtpu.so" | head -1) && \
    mkdir -p /lib/libtpu /usr/lib && \
    ln -sf "$LIBTPU_PATH" /lib/libtpu/libtpu.so && \
    ln -sf "$LIBTPU_PATH" /usr/lib/libtpu.so

# Set environment variables
ENV TPU_LIBRARY_PATH=/lib/libtpu/libtpu.so
ENV LD_LIBRARY_PATH=/usr/lib:/usr/local/lib:/lib/libtpu
ENV LD_PRELOAD=/lib/libtpu/libtpu.so

EXPOSE 8500 8501
CMD ["/usr/bin/tensorflow_model_server"]

Run Container

docker build -t tensorflow-serving-devel-tpu -f Dockerfile.devel-tpu .
docker build -t tensorflow-serving-tpu -f Dockerfile.tpu .

docker run -p 8500:8500 -p 8501:8501 \
  --mount type=bind,source=/path/to/model,target=/models/my_model \
  -e MODEL_NAME=my_model \
  tensorflow-serving-tpu

Result: Immediate crash with filesystem error.

What Has Been Tried

Environment Variables: Set TPU_LIBRARY_PATH, LD_LIBRARY_PATH, LD_PRELOAD
Multiple Library Locations: Symlinked libtpu.so to /lib/libtpu/, /usr/lib/, /usr/local/lib/
Removed GCE-Specific Flags: Commented out -DLIBTPU_ON_GCE in .bazelrc
Custom Build Configs: Tried --config=tpu-docker and direct --define=with_tpu_support=true
Different libtpu Sources:

Copied from TPU VM host (/usr/local/lib/python3.10/dist-packages/libtpu/libtpu.so)

Installed via pip install libtpu

Attempted torch-xla[tpu] installation

Verified Binary: Confirmed TPU support is compiled in (checked with strings)
Verified Library: Confirmed libtpu.so exists and is accessible (339MB, valid ELF)

Root Cause Analysis

The tpu_api_dlsym_initializer.cc code in TensorFlow core appears to have hardcoded logic that:

Only works in Google Cloud Engine (GCE) VM environments
Returns an empty string when not in GCE context
Does NOT properly fall back to checking TPU_LIBRARY_PATH environment variable
Fails during static initialization before environment variables can take effect
The relevant code path (tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:95) logs:

Opening library: [empty string] This suggests GetLibraryPath() returns "" instead of reading from environment variables.

Workarounds Attempted (All Failed)

Mounting host libtpu.so into container
Using LD_PRELOAD to force library loading
Multiple symlinks in every possible path
Patching .bazelrc to remove GCE-specific compilation flags

Questions

Is TPU support in Docker containers officially supported?
If not, are there plans to add support?
Can tpu_api_dlsym_initializer.cc be updated to properly check TPU_LIBRARY_PATH in non-GCE environments?
Are there internal Google builds/configurations that work in containers?

Oct 09 '25 17:10 arpitagarwal-meesho