nx_image NxImage.resize memory leak?

I'm transforming images to run them through an Nx.Serving. One of the transformation steps is a resize. I was originally using the resize function in this module. When we first put it under heavy load memory usage shot up quickly and never came back down. Swapping out the NxImage.resize call for StbImage.resize resolved the memory issue. I suspect there is some memory leak in the resize function of this library.

Nov 28 '24 01:11 stocks29

Are you queuing a lot of stuff for inference at once? It's possible the job is falling behind and holding a never ending queue of tensors

Nov 28 '24 02:11 seanmor5

Yes, it is worth mentioning that EXLA can only execute one operation at a time per device. So if you are using serving, you want to be confident you are batching them together, and you are not racing it with other Nx operations.

The best use of this library is if you are batching it together with other parts of your ML graph, so you squeeze the resizing as part of your model inside the serving, otherwise StbImage will be better indeed.

Nov 28 '24 09:11 josevalim

Hey guys, thanks for the quick responses and the tips!

I'm using an Oban queue that is limited to one concurrent job. Observing the logs I can confirm that only one job is processing at a time.

Sorry for the confusion, by heavy load I meant pushing a lot of images through it vs just testing a handful of images. These images were all still pushed through serially.

I should also mention that memory usage continued to grow while using NxImage.resize. On our cloud provider we went from 350mb of memory usage to 4gb (max on the box). On my local mac it hit 15gb before I stopped it.

Nov 28 '24 13:11 stocks29

Oh, thank you! So I think there is indeed something going wrong here, but if I had to guess, it would be more on Nx.Serving side. Can you provide an example that allows us to reproduce it? You don't need to use Oban. Perhaps a script that starts the serving and sends the same file to it for resizing over and over again?

Nov 28 '24 13:11 josevalim

In my use case I was performing the NxImage.resize outside of a serving - the serving was provided with Ortex since I was loading an ONNX model. Given that, I put together a livebook which mimics that setup. I can see about setting up a serving if that's still desirable. Just let me know.

Here's the livebook:

https://gist.github.com/stocks29/4f51df8a1e0dce46505f770faf83fb1d

For some reason the resize is taking much longer (100+ seconds vs sub-second) in the livebook than it did in my actual application.

Nov 28 '24 17:11 stocks29

@stocks29 to make sure we are on the same page, you just load one image at a time, do NxImage.resize and then call Nx.Serving.batched_run? Are the images of certain sizes, or are the sizes entirely arbitrary?

For some reason the resize is taking much longer (100+ seconds vs sub-second) in the livebook than it did in my actual application.

The notebook is actually using Nx.BinaryBackend, it should be this:

  config: [
-    config: [nx: [default_backend: EXLA.Backend]]
+    nx: [default_backend: EXLA.Backend]
  ]

Nov 29 '24 14:11 jonatanklosko

On our cloud provider we went from 350mb of memory usage to 4gb (max on the box). On my local mac it hit 15gb before I stopped it.

Also, just for more context, did it happen quickly, or over a longer period of time?

Nov 29 '24 14:11 jonatanklosko

Here's an updated example. I run it for a while and the memory usage does go up slowly, but reliably. I tried explicit GC and running the resize jitted, in both cases it seems to be going up anyway. Interestingly, it doesn't go up every iteration, sometimes it actually takes a while to change (at least as reported by ps).

Mix.install([
  {:exla, "~> 0.9.2"},
  {:nx, "~> 0.9.2"},
  {:nx_image, "~> 0.1.2"}
])

Nx.global_default_backend(EXLA.Backend)

defmodule Test do
  def run() do
    # fun = EXLA.jit(&NxImage.resize(&1, {224, 224}))

    tensor = Nx.iota({496, 950, 3}, type: :u8)

    Enum.each(1..1_000_000, fn i ->
      if rem(i, 100) == 0, do: IO.puts("before: #{get_process_memory()} KB")

      NxImage.resize(tensor, {224, 224})
      # fun.(tensor)
      # :erlang.garbage_collect(self())

      if rem(i, 100) == 0, do: IO.puts("after:  #{get_process_memory()} KB")
    end)
  end

  defp get_process_memory() do
    pid = System.pid()
    {result, 0} = System.cmd("ps", ~w(-o rss= -p #{pid}))
    result |> String.trim() |> String.to_integer()
  end
end

Test.run()

Nov 29 '24 14:11 jonatanklosko

I managed to observe the same behaviour in Jax, so this may be something in XLA. I opened an issue in https://github.com/jax-ml/jax/issues/25184.

Nov 29 '24 15:11 jonatanklosko

@stocks29 to make sure we are on the same page, you just load one image at a time, do NxImage.resize and then call Nx.Serving.batched_run? Are the images of certain sizes, or are the sizes entirely arbitrary?

For some reason the resize is taking much longer (100+ seconds vs sub-second) in the livebook than it did in my actual application.

The notebook is actually using Nx.BinaryBackend, it should be this:
  config: [
-    config: [nx: [default_backend: EXLA.Backend]]
+    nx: [default_backend: EXLA.Backend]
  ]

Yes that is correct, we load one image at a time, do the resize and then call batched_run. The images are of varying sizes.

Thanks for pointing out my livebook config issue. I guess I had been starring at the screen too long.

Also, just for more context, did it happen quickly, or over a longer period of time?

It happened steadily over a long period of time. On average it increased by a few megabytes per image.

Nov 29 '24 21:11 stocks29

I would keep this open, to track the upstream issue :)

Dec 02 '24 14:12 jonatanklosko

oh, sorry, I didn't mean to close this. I merged a related PR in a private repo and github automatically closed this one.

Dec 02 '24 15:12 stocks29