BirdNET-Analyzer icon indicating copy to clipboard operation
BirdNET-Analyzer copied to clipboard

Docker GPU instructions for BirdNET-Analyzer?

Open ghost opened this issue 2 years ago • 30 comments

I have been a huge user of the original BirdNET - thank you for creating it!!! I want to run this new version over our existing collection Having a similar "Dockerfile-GPU" set of instructions (to the original BirdNET) for the new version would speed that up immensely.

ghost avatar Apr 15 '22 15:04 ghost

This is probably going to be tricky. We're using TFLite this time and as of now, that only runs on CPU. However, the repo also contains the Keras model (well, it contains the protobuf version of the Keras model) and that should be able to run on GPU. But I got mixed feedback from people and it seems that doesn't work for everyone. We'd have to do a bit more experimenting on that. In light of that, a GPU container might not make as much sense as it did for the original BirdNET repo. What do you think?

kahst avatar Apr 20 '22 08:04 kahst

This would be very useful for me! I have ~10000 hours of recordings that I want to run BirdNET over, and a better GPU than CPU. Even if formal support is not in the cards, I'd be curious for anyone who has gotten it working with the GPU to give a rough outline of what they had to do. I'll probably give it a shot myself this week

brendanwallison avatar Apr 25 '22 04:04 brendanwallison

I'll see if I can do some experiments, in the meantime, if anyone has some experience, please let us know.

kahst avatar Apr 25 '22 17:04 kahst

This is probably going to be tricky. We're using TFLite this time and as of now, that only runs on CPU. However, the repo also contains the Keras model (well, it contains the protobuf version of the Keras model) and that should be able to run on GPU. But I got mixed feedback from people and it seems that doesn't work for everyone. We'd have to do a bit more experimenting on that. In light of that, a GPU container might not make as much sense as it did for the original BirdNET repo. What do you think?

Stefan - Thanks so much for the reply. Bloated containers are not a limitation for my situation so if the 10:1 speed increase of a GPU container (relative to CPU) still holds, I would welcome a Docker GPU implementation. Nonetheless, I recognize that having the full tensorflow/Keras version might present a maintenance challenge for this project.

drdougwelch avatar Apr 25 '22 19:04 drdougwelch

I seem to have gotten it working outside of Docker. Here is my housekeeping so far, including random speedbumps not expressly related to the GPU.

  1. I didn't want to deal with docker yet so I just pip installed everything. Assuming you have NVIDIA drivers, cudnn, etc. already installed on your system and on your path, really the only relevant change is to import tensorflow-gpu instead of tensorflow. I assume it's not dramatically more complicated with the Dockerfile, but I just wanted a proof of concept before mucking about with Dockerfiles that I'm not very good at.

  2. The config file already has a line to load the protobuf model path. Uncomment that; comment out the tflite model path.

  3. Unrelated to GPU computing, I wanted to use the config file for other settings. Turns out you have to comment out a bunch of stuff in the main section of analyze.py. The basic flow is the config is imported then overwritten by command line arguments as relevant. However, the arg parser also has default values for some items like batch_size, which means your config file will be overwritten even if you don't specify a command line argument. Fastest fix is to comment out everything to do with "arg," though obviously this breaks command line parsing in the process.

  4. I was puzzled that loading the protobuf model would also break batch processing. Line 211 of analyze.py converts your samples list into a numpy array of the correct size -- eg for two samples of typical length, your array is of shape (2, 144000). But when you feed it into model.predict on line 212 you get back a single result vector of shape (2434). In contrast, the TFLite model returns (2, 2434), which is the desired behavior. It's an easy fix: on line 140 of model.py, you'll find

prediction = PBMODEL.predict(sample)[0]

In other words it was actually batch processing, but then only returning the very first result. Change it to prediction = PBMODEL.predict(sample) and batch processing works fine.


I haven't tested much or benchmarked at all, but my GPU utilization spikes as expected and the outputs are all the right shape. The minimal testing I've done beyond this is to feed in duplicate samples and confirm I'm getting identical predictions back out.

brendanwallison avatar Apr 25 '22 19:04 brendanwallison

Thanks @brendanwallison for testing. I have a few thoughts: 1) pip install tensorflow should also install GPU support, no need for pip install tensorflow-gpu, 2) Your're right, using command line args overwrites config settings, which is not ideal at the moment, 3) I'll fix batch prediction like you mentioned, seems like an easy fix.

What we also might want to do is add --gpu True as command line argument and then make the switch in the config file to load the protobuf model. Would only work for analyzer.py though. A dedicated GPU container would then set this flag to True as default. What do you think?

kahst avatar Apr 26 '22 07:04 kahst

Happy to help, this is a great resource.

I usually use Pytorch so I'm sure you're right about not needing tensorflow-gpu.

For the rest, it seems like that is simple and gets the job done.

The only thing I'd add to your plan is a print statement somewhere to confirm that the GPU is indeed being used (or alternately that a GPU was not found).

brendanwallison avatar Apr 27 '22 15:04 brendanwallison

Shifting to large batch prediction as opposed to proof of concept, I discovered I would eventually run out of RAM after a certain number of files processed. I believe this was due to a Keras memory leak as opposed to an error in your code. Even when I distilled it to calling PBMODEL.predict(samples) in a loop, there would be accumulating memory usage. The steps in this post fixed it for me:

https://github.com/keras-team/keras/issues/13118#issuecomment-749455410

Can you verify whether or not you get the same memory leak with the Protobuf model as I did? Also, I'm less fluent with Keras--are the fixes in the linked post reasonable, or would you recommend another fix?

brendanwallison avatar May 05 '22 00:05 brendanwallison

That's an issue that has been bugging me for a long time now. And it still exists, there are ways to mitigate the memory leak a bit, but I was not able to find a way to completely prevent it from happening. It seems that using predict_on_batch instead of predict might help to delay the out of memory error.

kahst avatar May 10 '22 10:05 kahst

Does it happen with the base models as well, or just the Protobuf model?

I have my fingers crossed that the memory leak has been plugged. Following the advice from the link above, I'm converting the input array to tensor beforehand and calling the forward method of the model directly (eg PBMODEL(sample)). No apparent memory leak when run in a loop or on a test folder of ~100gb of data. No guarantees as my code is being slashed down for simplicity but if you hadn't tried that specific combination of steps it might be worth a shot.

One unrelated GPU RAM usage issue is sparked by CPU multithreading. I noticed that each thread calls the load_model function with a corresponding spike in GPU RAM usage. I don't know, but I assume that each thread is loading the model itself into GPU memory, which ends up creating a huge bottleneck. My short-term fix is to load very large batches with a single thread. This might have brought things to the point where it's good enough for my purposes, as I'm getting a substantial speedup despite being obviously IO-bound. The better solution would be a multi-threaded data loader feeding moderately sized batches to the GPU.

brendanwallison avatar May 11 '22 05:05 brendanwallison

It seems this issue only occurs with the protobuf model, the TFLite version does not seem to show this kind of behavior. Feel free to make a pull request with your changes if that indeed fixes the memory leak.

For now, each thread will load its own model, I think due to the Python thread lock, we can't use one model for multiple threads. Good to see that increasing the batch size does actually help though.

kahst avatar May 11 '22 07:05 kahst

Hey, @kahst @brendanwallison , I've been hoping to get GPU processing working for a project.

Any ideas on what is causing this warning and error? Is it as simple as keras_metadata.pb missing from the v2.2 model?

WARNING:tensorflow:SavedModel saved prior to TF 2.5 detected when loading Keras model. Please ensure that you are saving the model with model.save() or tf.keras.models.save_model(), NOT tf.saved_model.save(). To confirm, there should be a file named "keras_metadata.pb" in the SavedModel directory.

File "C:\Users\Troy\OneDrive\Projects\BirdNET-Analyzer\analyze.py", line 268, in analyzeFile p = predict(samples) File "C:\Users\Troy\OneDrive\Projects\BirdNET-Analyzer\analyze.py", line 217, in predict prediction = model.predict(data) File "C:\Users\Troy\OneDrive\Projects\BirdNET-Analyzer\model.py", line 139, in predict prediction = PBMODEL.predict(sample) AttributeError: '_UserObject' object has no attribute 'predict'

tgruetzm avatar Dec 13 '22 06:12 tgruetzm

@kahst, this seems to be the same as issue #57.

tgruetzm avatar Dec 13 '22 17:12 tgruetzm

One unrelated GPU RAM usage issue is sparked by CPU multithreading. I noticed that each thread calls the load_model function with a corresponding spike in GPU RAM usage. I don't know, but I assume that each thread is loading the model itself into GPU memory, which ends up creating a huge bottleneck. My short-term fix is to load very large batches with a single thread. This might have brought things to the point where it's good enough for my purposes, as I'm getting a substantial speedup despite being obviously IO-bound. The better solution would be a multi-threaded data loader feeding moderately sized batches to the GPU.

This is a really good point. I got my project running under the GPU, as long as I use v2.1 for the model due to the missing meta data file. I've been working on a multi-threaded pre-loader to feed the GPU. I don't think it's very feasible to run multiple instances of the model on the GPU. The GPU is still fairly lightly loaded just processing what the loader feeds it. The next problem is that the files are fully loaded into memory. I may need to implement file streaming or buy a ton of memory for my application.

tgruetzm avatar Dec 16 '22 18:12 tgruetzm

What's the latest status of running BirdNET-Analyzer with GPU? Assuming I install the dependencies rather than use Docker, is there a way to use NVIDIA GPU for inference acceleration? Is it still necessary to use the Protobuf model rather than the tflite model? Thanks

sammlapp avatar Dec 29 '23 21:12 sammlapp

Tensorflow should automaticall detect and use your GPU

Josef-Haupt avatar Feb 23 '24 19:02 Josef-Haupt

@Josef-Haupt I am trying to run the model in GPU. I have installed tensorflow-gpu (pip install tensorflow-gpu==2.7.0) but when I run analyze.py it is using only CPU. Could you please let me know if there is flag or something similar I need to change in order to run the model in GPU?

Guruprasadhegde avatar Mar 18 '24 12:03 Guruprasadhegde

I also cannot use GPU. My environment has a working tensorflow installation and I don't have problems using the GPU with other tensorflow models such as Perch. My environment has:

tensorflow                   2.13.0
tensorflow-estimator         2.13.0
tensorflow-hub               0.14.0
tensorflow-io-gcs-filesystem 0.34.0

and Python 3.9.18

sammlapp avatar Apr 16 '24 13:04 sammlapp

@Josef-Haupt I am trying to run the model in GPU. I have installed tensorflow-gpu (pip install tensorflow-gpu==2.7.0) but when I run analyze.py it is using only CPU. Could you please let me know if there is flag or something similar I need to change in order to run the model in GPU?

tensorflow-gpu is outdated you can just use normal tensorflow

Josef-Haupt avatar Apr 16 '24 15:04 Josef-Haupt

I also cannot use GPU. My environment has a working tensorflow installation and I don't have problems using the GPU with other tensorflow models such as Perch. My environment has:

tensorflow                   2.13.0
tensorflow-estimator         2.13.0
tensorflow-hub               0.14.0
tensorflow-io-gcs-filesystem 0.34.0

and Python 3.9.18

Hard to say, how are you using the repository? Could also be a version mismatch, we use TF 2.15 to train BirdNET.

Josef-Haupt avatar Apr 16 '24 15:04 Josef-Haupt

Line 12 in model.py prevents the GPU working:

os.environ["CUDA_VISIBLE_DEVICES"] = ""

Change it to

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

And you should be good to go.

Mattk70 avatar Apr 16 '24 16:04 Mattk70

"Change it to

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

And you should be good to go."

sadly, this didn't work for me :/ Has it worked for others?

MZiegenhorn avatar Apr 18 '24 21:04 MZiegenhorn

@MZiegenhorn, There could be many reasons it doesn't work. Most importantly, my comment assumes a working CUDA environment. Type nvidia-smi in your terminal,. If you have cuda, you should see something like this (note CUDA in the top right):

image

If you have this, you will need to inspect logs in more detail to trace why the program is not using CUDA. Above the line:

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

There is this line:

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

If you change that to

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"

and then re-run python analyze.py

You will see a bunch of warnings about unsaved custom gradients. They can be ignored. When I change CUDA_VISIBLE_DEVICES to "0", this line, which was printed before all those warnings:

E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

disappears from the output, and the inference is faster. Not much faster, but I also see the python process activity on the GPU if I inspect nvidia-smiwhile the program is running, so I know the GPU is being used.

Mattk70 avatar Apr 19 '24 18:04 Mattk70

@Mattk70 thanks for responding :)

Yeah, the computer says I have a working CUDA environment, but still no luck. I tried changing "VISIBLE_DEVICES" to 0 and "MIN_LOG_LEVEL" to 2. I've decided performance on the CPU is fast enough for now that it's not worth the additional debugging headache. Maybe after the field season...

but, seriously, I appreciate the help!

MZiegenhorn avatar Apr 23 '24 01:04 MZiegenhorn

Anyone have any updates on This? our workplace runs birdnet on a large scale and GPU accelleration would be a big help.

carlos-serrouya avatar May 16 '24 16:05 carlos-serrouya

I’m not convinced GPU acceleration will be a BIG help, but if it’s not working for you you’ll need to paste your console output / errors to get further assistance.

Mattk70 avatar May 17 '24 08:05 Mattk70

I don't have any errors, I run Bird-NET in a CUDA enabled environment which uses GPU acceleration on other TF NN, but I haven't managed to get GPU acceleration for Bird-NET. I just want to know if it's even possible, when you run it is there GPU acceleration?

carlos-serrouya avatar May 17 '24 15:05 carlos-serrouya

So, if I understand correctly you have run BirdNet on GPU. However, you are not seeing an improvement in speed.

That does not surprise me. I see ~20% speed up (IIRC) but I have an NVIDIA 3090 to throw at it. The BirdNET model architecture is such that a GPU cannot run many of the ops using its tensor cores.

Fixing this would require re-architecting the model from the ground up and training a new model from scratch - not quite back to square one, but a significant undertaking. In any event, it would take a lot of time and effort, and might even result in a model which performs faster, but with less accuracy.

Mattk70 avatar May 17 '24 16:05 Mattk70

Silly me, I forgot to mention that BirdNet running under Node.js is about 5x faster than using the BirdNET Analyzer CLI or GUI.

If you want a faster BirdNET, why not give Chirpity a trial.

Mattk70 avatar May 17 '24 16:05 Mattk70

I appreciate your reply, it saves me possibly weeks of rabbit holing. I'll look into Chirpity.

carlos-serrouya avatar May 17 '24 16:05 carlos-serrouya