tileserver-gl icon indicating copy to clipboard operation
tileserver-gl copied to clipboard

Strange memory issue raster tile rendering

Open MichielMortier opened this issue 8 months ago • 15 comments

Memory leak when serving raster tiles

Description

When using tileserver-gl (v5.3.1) for serving raster tiles, there appears to be a memory leak. The server's memory usage steadily increases when loading raster tiles, unlike with vector tiles where memory usage remains stable.

Observed Behavior

  1. When using vector tiles, memory usage is stable regardless of the number of requests
  2. When switching to raster tiles (by changing vector to raster in the URL), memory usage spikes with each new request
  3. Memory increases when viewing different areas/zoom levels that require loading new tiles
  4. Memory appears to stabilize after most tiles have been loaded, but is never released
  5. When requesting tiles from a new data source, memory usage spikes again

Evidence

Memory usage graph

Environment

  • tileserver-gl version: v5.3.1
  • Running in Docker: maptiler/tileserver-gl:v5.3.1
  • Command used: docker run --rm -it -v "$SCRIPT_DIR/tileserver-gl:/data" -p 8080:8080 maptiler/tileserver-gl:v5.3.1

Configuration

tileserver-gl config.json:

{
  "options": {
    "paths": {
      "fonts": "fonts",
      "styles": "styles"
    }
  },
  "styles": {
    "tiles-4": {
      "style": "tiles-4/style.json"
    },
    "tiles-5": {
      "style": "tiles-5/style.json"
    }
  },
  "data": {
  }
}

Example style.json:

{
  "version": 8,
  "name": "Style",
  "sources": {
    "tiles": {
      "type": "vector",
      "tiles": ["https://XXX/{z}/{x}/{y}.pbf"],
      "maxzoom": 14,
      "minzoom": 5
    }
  },
  "sprite": "https://XXX/sprites/osm-bright/icons",
  "glyphs": "https://XXX/fonts/{fontstack}/{range}.pbf",
  "layers": [
    {
      "id": "roads-casing",
      "type": "line",
      "source": "tiles",
      "source-layer": "default",
      "filter": [
        "all", 
        ["has", "testValue"]
      ],
      "layout": {
        "line-cap": "round",
        "line-join": "round",
        "visibility": "visible"
      },
      "paint": {
        "line-color": "#000000",
        "line-width": [
          "interpolate", ["linear"], ["zoom"],
          5, 4,
          8, 5,
          11, 5.5,
          14, 6,
          18, 7
        ],
        "line-opacity": 1
      }
    },
    {
      "id": "roads-fill",
      "type": "line",
      "source": "tiles",
      "source-layer": "default",
      "filter": [
        "all", 
        ["has", "testValue"]
      ],
      "layout": {
        "line-cap": "round",
        "line-join": "round",
        "visibility": "visible"
      },
      "paint": {
        "line-color": [
          "step",
          ["get", "testValue"],
          "#660000", 0,
          "#FF7700", 50,
          "#FFFF00", 75,
          "#00FF00", 100,
          "#0099FF", 150,
          "#00FF00"
        ],
        "line-width": [
          "interpolate", ["linear"], ["zoom"],
          5, 2,
          8, 3,
          11, 3.5,
          14, 4,
          18, 5
        ],
        "line-opacity": 1
      }
    }
  ]
}

Steps to Reproduce

  1. Set up tileserver-gl with the above configuration
  2. Load vector tiles for a while - observe stable memory usage
  3. Switch to raster tiles (changing vector to raster in the URL)
  4. Pan/zoom around the map to load different tiles - observe memory increasing
  5. Switch to a different style/data source - observe another memory spike

Expected Behavior

Memory usage should remain stable for raster tiles just as it does for vector tiles, or the memory should be released after tiles are no longer needed.

MichielMortier avatar May 16 '25 09:05 MichielMortier

Just noticed the memory issue is not present when running it on mac when using npm install tileserver-gl, but I have it when running the docker image locally.

MichielMortier avatar May 16 '25 12:05 MichielMortier

Does this happen in 5.3.0? In 5.3.1 the only real difference for raster tiles is maplibre-native was updated to a new version. In that new version of maplibre-native v6.1.0 the legacy renderer was removed and switched to the drawable renderer, so I wonder if it could be related to that. For example, I saw this issue on performance https://github.com/maplibre/maplibre-native/issues/3438

What kind of data files are you using? is that just a https xyz enpoint? no a local tile source? that would mostly be handled by this code https://github.com/maptiler/tileserver-gl/blob/master/src/serve_rendered.js#L1170-L1200

What was the last version this did not happen?

acalcutt avatar May 16 '25 12:05 acalcutt

Hey @acalcutt

First I was using v4.11.1, but that version had the same issues. I never rendered rastertiles before, so I have no idea in which version this issue was introduced. I indeed just use an XYZ link from an other tileserver. Is there something in my config that I forgot to add? Is it something you can produce locally when using the docker file? Or are there some stats I can provide to you? When I do 'top' I see all the memory is allocated by the node process. When i try to do heap dumps, I only get about 23MB of it. So I have really no idea what the issue could be...

MichielMortier avatar May 16 '25 13:05 MichielMortier

I'm using the docker version since a while (both for vector and raster) and I have had -and still ahve- plenty of crashes, but I have no evidence to identify what is going wrong.

I ended up using this command below to run the container :

docker run --name tileserver-gl -d -m 16GB --memory-reservation 10GB --cpus=7 --ulimit nofile=65536:65536 --restart=always -e NODE_ENV=production -e NODE_OPTIONS="--max-old-space-size=16384" -it -v /var/data:/data -p 8080:8080 maptiler/tileserver-gl

with the restart=always option.

Now the container fails from time to time but it restarts automatically.

I have not found anything in the container logs that explains the crash... (would be intersted to know where to look...)

utagawal avatar May 16 '25 13:05 utagawal

I've been looking the whole time for it, could this be related to https://github.com/lovell/sharp/issues/955? I've also been trying to change to docker file to a alpine version, but that is not working. Any solutions? Or is this unrelated?

MichielMortier avatar May 16 '25 15:05 MichielMortier

Now the container fails from time to time but it restarts automatically.

I have not found anything in the container logs that explains the crash... (would be intersted to know where to look...)

If it is memory related, you may check the journal for oom killer messages (something like ... kernel: [2269670.312226] xxxx invoked oom-killer: ....

Just noticed the memory issue is not present when running it on mac when using npm install tileserver-gl, but I have it when running the docker image locally.

This may hint to system dependencies. I'm not sure, because I have no experience with Apple hardware, but which architecture has your system? Then it may be a different docker image (arm vs x86?).

PS: I will re-check my system in the next few days (version, uptime and memory usage), but AFAIK there were no memory issues and unexpected restarts.

okimiko avatar May 18 '25 19:05 okimiko

Hey @okimiko

My mac has a arm64 architecture.

I've been looking the whole time for it, could this be related to https://github.com/lovell/sharp/issues/955? I've also been trying to change to docker file to a alpine version, but that is not working. Any solutions? Or is this unrelated?

The issue i posted here could explain the difference between the architectures. Debian uses an other memory allocator, which make the RSS report a high load, but the memory is just not released in the right way. I think we could change to jemalloc and that might fix the issue?

MichielMortier avatar May 19 '25 06:05 MichielMortier

At least I can confirm the issue in version 5.2.0-pre.0 (docker, x86 based):

  • The process was running 8 weeks with ~13,8GB resident memory
  • After a restart the service starts with ~4GB
  • I just queried some raster data and the memory consumpion increases (in fact quite fast)

@MichielMortier: ~I will test the jemalloc integration in local setup.~ I tested the jemalloc integration (here) but have not seen any improvement. May be I have forgotten something or we need another approach.

okimiko avatar May 19 '25 19:05 okimiko

@okimiko Did you get any errors on that it failed to load? I got it working somehow, and i see the memory decreasing again after the load was applied. But I still see a diff of memory between before and after the request. My dockerfile looks like this (as I am building on mac, I cannot just use /usr/lib/x86_64-linux-gnu/libjemalloc.so.2

FROM ubuntu:jammy AS builder

ENV NODE_ENV="production"

SHELL ["/bin/bash", "-o", "pipefail", "-c"]

RUN export DEBIAN_FRONTEND=noninteractive && \
    apt-get update && \
    apt-get install -y --no-install-recommends --no-install-suggests \
      build-essential \
      ca-certificates \
      curl \
      gnupg \
      pkg-config \
      xvfb \
      libglfw3-dev \
      libuv1-dev \
      libjpeg-turbo8 \
      libicu70 \
      libcairo2-dev \
      libpango1.0-dev \
      libjpeg-dev \
      libgif-dev \
      librsvg2-dev \
      gir1.2-rsvg-2.0 \
      librsvg2-2 \
      librsvg2-common \
      libcurl4-openssl-dev \
      libpixman-1-dev \
      libpixman-1-0 \
      libjemalloc2 \
      libjemalloc-dev && \
    mkdir -p /etc/apt/keyrings && \
    curl -fsSL https://deb.nodesource.com/gpgkey/nodesource-repo.gpg.key | gpg --dearmor -o /etc/apt/keyrings/nodesource.gpg && \
    echo "deb [signed-by=/etc/apt/keyrings/nodesource.gpg] https://deb.nodesource.com/node_22.x nodistro main" | tee /etc/apt/sources.list.d/nodesource.list && \
    apt-get -qq update && \
    apt-get install -y --no-install-recommends --no-install-suggests nodejs && \
    npm i -g npm@latest && \
    apt-get -y remove curl gnupg && \
    apt-get -y --purge autoremove && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

RUN mkdir -p /usr/src/app

WORKDIR /usr/src/app

COPY package.json /usr/src/app
COPY package-lock.json /usr/src/app

RUN npm config set maxsockets 1 && \
    npm config set fetch-retries 5 && \
    npm config set fetch-retry-mintimeout 100000 && \
    npm config set fetch-retry-maxtimeout 600000 && \
    npm ci --omit=dev && \
    chown -R root:root /usr/src/app

FROM ubuntu:jammy AS final

SHELL ["/bin/bash", "-o", "pipefail", "-c"]

RUN export DEBIAN_FRONTEND=noninteractive && \
    groupadd -r node && \
    useradd -r -g node node && \
    apt-get -qq update && \
    apt-get install --force-yes -yy --no-install-recommends --no-install-suggests \
      ca-certificates \
      curl \
      gnupg \
      xvfb \
      libglfw3 \
      libuv1 \
      libjpeg-turbo8 \
      libicu70 \
      libcairo2 \
      libgif7 \
      libopengl0 \
      libpixman-1-0 \
      libcurl4 \
      librsvg2-2 \
      libpango-1.0-0 \
      libjemalloc2 && \
    mkdir -p /etc/apt/keyrings && \
    curl -fsSL https://deb.nodesource.com/gpgkey/nodesource-repo.gpg.key | gpg --dearmor -o /etc/apt/keyrings/nodesource.gpg && \
    echo "deb [signed-by=/etc/apt/keyrings/nodesource.gpg] https://deb.nodesource.com/node_22.x nodistro main" | tee /etc/apt/sources.list.d/nodesource.list && \
    apt-get -qq update && \
    apt-get install -y --no-install-recommends --no-install-suggests nodejs && \
    npm i -g npm@latest && \
    # Create appropriate symlinks if needed
    ln -sf $(find /usr -name "libjemalloc.so*" | head -n 1) /usr/lib/libjemalloc.so && \
    apt-get -y remove curl gnupg && \
    apt-get -y --purge autoremove && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# Set environment variables after we've confirmed the library path
ENV \
    NODE_ENV="production" \
    CHOKIDAR_USEPOLLING=1 \
    CHOKIDAR_INTERVAL=500 \
    LD_PRELOAD="/usr/lib/libjemalloc.so" \
    MALLOC_CONF="background_thread:true,metadata_thp:auto,dirty_decay_ms:5000,muzzy_decay_ms:5000"

COPY --from=builder /usr/src/app /usr/src/app

COPY . /usr/src/app

RUN mkdir -p /data && chown node:node /data
VOLUME /data
WORKDIR /data

EXPOSE 8080

USER node:node

ENTRYPOINT ["/usr/src/app/docker-entrypoint.sh"]

HEALTHCHECK CMD node /usr/src/app/src/healthcheck.js

MichielMortier avatar May 20 '25 11:05 MichielMortier

Some caching is normal, for example, if you were using pmtiles there is a tile cache of 100 tiles for each data source by default. So if you browse around a map it will build that up that cache. I'm am guessing mbtiles/sqlite does something similar but I am not sure about that. I wouldn't really expect it with a http endpoint though.

I don't use the docker here, i just have a ubuntu 22.04 virtual server with npm, but when I test there the memory seem to builtd up when browsing the map, but seems to eventually lower. that was with limited testing though. I will say I have not seen many crashed, though as part of my project the service does get restarted every night. I also probably don't use rasters to the extent of some of you, but I am in there testing things a lot to make sure stuff works.

acalcutt avatar May 20 '25 16:05 acalcutt

@okimiko Did you get any errors on that it failed to load? I got it working somehow, and i see the memory decreasing again after the load was applied. But I still see a diff of memory between before and after the request. My dockerfile looks like this (as I am building on mac, I cannot just use /usr/lib/x86_64-linux-gnu/libjemalloc.so.2

No, I hadn't had any error outputs. I'm not sure, may be I was just too impatient (for the default settings). I ran the same image again today and after I waited longer, the memory started to decrease /o\

@acalcutt: May be the caching and the memory size of the cache depends on the detail displayed on the tile itself?

But nevertheless I applied your changes (here) because of the more generic integration and the more "aggressive" values of MALLOC_CONF and it worked, too or even better (faster). @MichielMortier do you want to create a PR?

okimiko avatar May 20 '25 20:05 okimiko

@okimiko I've created a pull request with the changes.

@acalcutt Is there a way to disable this caching, or show me the place where it is done? Because the tiles I'm serving are changed every minute, so I would want to have the tileserver idle as low as possible if not used. Currently I am fetching my tiles from a tile URL, so I don't see where the caching could be done.

With jemalloc, I see that the memory is fried, which is nice, but it would be better if we could go back to the levels at restart (100Mib) instead of keeping 400Mib Image

MichielMortier avatar May 21 '25 07:05 MichielMortier

For pmtiles caching is a default of the pmtiles library. I think it gets set at https://github.com/protomaps/PMTiles/blob/main/js/src/index.ts#L761 , not something we are setting right now (though that may be possible to add). The cache is on the pmtiles data sources and not the rendered tiles.

I mainy know about it because the pmtiles object gets put into the data array and I ran into issues when that cache built up making the indec page take long to load. i had to switch that to a shallow copy so those cache values didnt get copied when loading the index page.

acalcutt avatar May 21 '25 12:05 acalcutt

After some profiling i found that the maplibre-gl-native object does keep some memory, even when not used. It gets clears when acquired pool object is destroyed (and so is the renderer). When keeping minRendererPoolSizes to [1,1,1], there is always some maplibre object that holds some memory. That's why I get the difference of 300MiB. Putting that to [0,0,0] makes sure all memory is released.

MichielMortier avatar May 21 '25 14:05 MichielMortier

At least that matches the (general) behavior described in the docs: https://tileserver.readthedocs.io/en/latest/config.html#minrendererpoolsizes :)

okimiko avatar May 21 '25 17:05 okimiko