sleap icon indicating copy to clipboard operation
sleap copied to clipboard

Potential memory leak during inference?

Open olinesn opened this issue 11 months ago • 14 comments

Bug description

When running inference, the GPU starts using system memory (as seen in the task manager) until there's none left, and then inference crashes. You can see that the GPU's "Shared GPU memory usage" climbs, but the on-GPU memory is hardly used at all.

Expected behaviour

Inference completes without issue, the way that it does for shorter videos.

Actual behaviour

Image

Your personal set up

Threadripper PRO 24 core 128 GB memory nVidia RTX 6000

Environment packages
# paste output of `pip freeze` or `conda list` here
Logs
# paste relevant logs here, if any
Image

Screenshots

Image

olinesn avatar Jan 06 '25 00:01 olinesn

O wow, that is a lot of GPU memory.

With your sleap environment activated, can you let us know the output of the command nvidia-smi. I would like to see your GPU set-up.

Thanks!

Elizabeth

eberrigan avatar Jan 06 '25 18:01 eberrigan

O wow, that is a lot of GPU memory.

With your sleap environment activated, can you let us know the output of the command nvidia-smi. I would like to see your GPU set-up.

Thanks!

Elizabeth

Image

Sure, here you go. Just to clarify, it doesn't look like the 48GB of the RTX 6000's GPU memory is being used. Rather it looks like this is getting swapped to system memory.

olinesn avatar Jan 06 '25 19:01 olinesn

With the first screen shot you sent it looks like it is trying to use the CPU. With the second screen shot I can partially see that it is using GPU 0, which is the one we want to use.

Do you mind just copying and pasting the entire command and output from the terminal instead of the screenshots?

Thanks!

eberrigan avatar Jan 06 '25 21:01 eberrigan

Hi @eberrigan , thanks for helping me work on this puzzle.

Here's a zipped file with the models (centroid and centered instance), a demo video to run inference on, and my logs when I run sleap-track.

You can see that even for a short video, when I run sleap-track, the GPU's dedicated memory is relatively unused, but the GPU swap memory keeps increasing until the rig is out of RAM.

At that point, in the logs, you can see that something changes (stalls at 57% completed, when all 128 GB of ram get saturated), and then something gets adjusted and it completes. Also it gets stuck at 100% with the green bar for several minutes, with all 24 of my cores running at 50%, not sure what that means...

Thanks!

Zipped file: https://drive.google.com/file/d/1NpfDJHKSh9Sv_ycMrNLf5lpOCn9giwQI/view?usp=sharing

Image

olinesn avatar Jan 08 '25 22:01 olinesn

Image

olinesn avatar Jan 08 '25 22:01 olinesn

Hey @olinesn is this a single animal experiment? I noticed you have --tracking.clean_instance_count set to one.

Let's just go ahead and do inference without tracking, if that is the case. You can set the -n MAX_INSTANCES, --max_instances MAX_INSTANCES which "Limit maximum number of instances in multi-instance models. Not available for ID models. Defaults to None." to 1 in order to perform inference with the top-down model with the number of animals = 1. Since you do not have more than one animal, there is not need to do tracking, and we can improve the model predictions with that constraint.

https://sleap.ai/develop/guides/cli.html

eberrigan avatar Jan 13 '25 18:01 eberrigan

Hi @eberrigan ,

Ok this seems to be running without complaint, that's strange that tracking would be causing this problem. Thanks for the suggestion!

A fraction of this dataset is two animals, so I'm going to need to do tracking eventually. Is there a better way that I can set up the flow tracking? Does this behavior indicate a memory leak issue in sleap?

Thanks, Stefan

olinesn avatar Jan 13 '25 22:01 olinesn

You should be able to do tracking but there isn't a reason to when there is only one animal.

I believe the issue was with the inference not knowing how many animals are in the new data, so that the shape of the tensor changes, causing retracing https://github.com/tensorflow/tensorflow/issues/34025. This might not be an issue when using the bottom-up method, which doesn't rely on the centroid model.

If you have some data with different numbers of animals you might get better results running inference separately and specifying the number of animals per dataset using -n. Am I understanding that correctly, or are you saying you have data where the animals go off-camera and then come back?

eberrigan avatar Jan 13 '25 23:01 eberrigan

Thanks ok that's helpful to understand. Sometimes I place one mouse in the box, and sometimes I place two.

Would you advice some sample syntax for the sleap-track command when there are two animals in the box? The logic makes sense of the tensor changing size, but I want to make sure I nail the syntax the way you're recommending.

olinesn avatar Jan 13 '25 23:01 olinesn

I see. Can you separate the videos so that when there is only one animal you run inference with -n set to 1 on that video, and when there are two animals you run inference with -n set to 2? This should help the model know when there are one or two animals and eliminate any shape discrepancies. If you cannot do that, then setting the max instances to 2 in all cases should suffice.

It will also improve tracking a lot since if everything is one video and tracking is run, when an animal reappears, a new track will be made, so I expect that if you are swapping animals or removing and replacing animals, you may end up with a lot of tracks at the end the video. Do you have some sort of pipeline for dealing with that?

eberrigan avatar Jan 14 '25 00:01 eberrigan

Sorry I got slammed at the end of last week. Thanks for your thoughts.

I see. Can you separate the videos so that when there is only one animal you run inference with -n set to 1 on that video, and when there are two animals you run inference with -n set to 2? This should help the model know when there are one or two animals and eliminate any shape discrepancies. If you cannot do that, then setting the max instances to 2 in all cases should suffice.

Yes it's perfectly doable for me to pre-determine the number of animals for the majority of these experiments. What would a reasonable sleap-track command look like? I just want to make sure I'm interpreting this correctly:

You can set the -n MAX_INSTANCES, --max_instances MAX_INSTANCES which "Limit maximum number of instances in multi-instance models. Not available for ID models. Defaults to None."

are "-n" and "--max_instances" synonymous, or do you have to use both to get this effect? I've never used "-n."

It will also improve tracking a lot since if everything is one video and tracking is run, when an animal reappears, a new track will be made, so I expect that if you are swapping animals or removing and replacing animals, you may end up with a lot of tracks at the end the video. Do you have some sort of pipeline for dealing with that?

We actually have to think about this a lot because of reflections. Sometimes if there are 3 animals plus reflections, one of the reflections can get picked up as an instance, and occasionally has a higher score than the 3 real animals. If we set max instances to 3, then we drop the instance of a real animal on that frame, so usually we're setting it to n+1 or n+2.

olinesn avatar Jan 21 '25 17:01 olinesn

Hi @olinesn,

Yes, -n and --max_instances are synonymous, the latter just more verbose and possibly more readable for others.

Thanks, Liezl

roomrys avatar Jan 22 '25 23:01 roomrys

@eberrigan Unfortunately this doesn't seem to solve the problem. Are you able to try giving it a go? If you can download the zipped file and try running sleap-track, I'm curious to see if it runs for you or crashes.

olinesn avatar Feb 07 '25 21:02 olinesn

Ok @eberrigan @roomrys so I've isolated this to [every tracker except 'simple']. It's very reproducible, hope this helps! Memory just plain keeps climbing until the PC crashes or inference finishes. If I switch to --tracking.tracker simple , the memory stays flat at ~10GB and inference runs fine.

olinesn avatar Mar 12 '25 18:03 olinesn