recognize icon indicating copy to clipboard operation
recognize copied to clipboard

GPU is not being used

Open derritter88 opened this issue 4 years ago • 19 comments

Hello @marcelklehr ,

I have enabled GPU support at the admin GUI but when I start a manual process via occ recognize:classify I can see that a process is being started and using 100 % a CPU core. The GPU is not being used.

I have installed all the specified Nvidia applications/libraries

derritter88 avatar Aug 26 '21 06:08 derritter88

Hi!

Are there any messages in the nextcloud log?

marcelklehr avatar Aug 26 '21 10:08 marcelklehr

Unfortunatley not - the only "warning" I can see in my log would be: [recognize] Warning: Classifying photos of user 3A60C52D-9415-4F28-A2B7-71A8CBD7A9E3 at 2021-08-26T08:37:57+02:00

The only thing I can see on my shell is that www-data is running node-v14.17.4-linux-x64. This processed cannot be stopped or killed - even a reboot does not solve it. I need to reset the whole VM to have the processed killed.

derritter88 avatar Aug 26 '21 10:08 derritter88

What I see additional within the log (but it's not linked to my manual start of the classifying process) would be: `[recognize] Warning: Classifier process output: 2021-08-26 07:59:08.434295: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. []

at 2021-08-26T07:59:08+02:00`

derritter88 avatar Aug 26 '21 10:08 derritter88

and:

[index] Error: Call to a member function getOwner() on null

GET /index.php/apps/recognize/admin/countMissed from 192.168.10.2 by 3A60C52D-9415-4F28-A2B7-71A8CBD7A9E3 at 2021-08-26T08:19:11+02:00

But I am not sure if this is linked to this issue or not.

derritter88 avatar Aug 26 '21 10:08 derritter88

Okay so during the night the new version was able to be downloaded. I did so today morning. Nextcloud 22.1.1 Recognize 1.6.3

When manually starting the process I get following error message: Classifying photos of user ED17CAA4-EC2F-4457-95AB-A5980927C9C8 Failed to classify images Classifier process error

My log would say: [recognize] Warning: Classifier process output: 2021-08-27 06:42:20.937775: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. Error: Cannot find module '@tensorflow/tfjs-node-gpu' Require stack:

  • /var/www/cloud/apps/recognize/src/efficientnet/EfficientnetModel.js
  • /var/www/cloud/apps/recognize/src/classifier_imagenet.js at Function.Module._resolveFilename (internal/modules/cjs/loader.js:889:15) at Function.Module._load (internal/modules/cjs/loader.js:745:27) at Module.require (internal/modules/cjs/loader.js:961:19) at require (internal/modules/cjs/helpers.js:92:18) at Object. (/var/www/cloud/apps/recognize/src/efficientnet/EfficientnetModel.js:11:9) at Module._compile (internal/modules/cjs/loader.js:1072:14) at Object.Module._extensions..js (internal/modules/cjs/loader.js:1101:10) at Module.load (internal/modules/cjs/loader.js:937:32) at Function.Module._load (internal/modules/cjs/loader.js:778:12) at Module.require (internal/modules/cjs/loader.js:961:19) { code: 'MODULE_NOT_FOUND', requireStack: [ '/var/www/cloud/apps/recognize/src/efficientnet/EfficientnetModel.js', '/var/www/cloud/apps/recognize/src/classifier_imagenet.js' ] } Trying js-only mode internal/modules/cjs/loader.js:892 throw err; ^

Error: Cannot find module '@tensorflow/tfjs-backend-wasm' Require stack:

  • /var/www/cloud/apps/recognize/src/efficientnet/EfficientnetModel.js
  • /var/www/cloud/apps/recognize/src/classifier_imagenet.js at Function.Module._resolveFilename (internal/modules/cjs/loader.js:889:15) at Function.Module._load (internal/modules/cjs/loader.js:745:27) at Module.require (internal/modules/cjs/loader.js:961:19) at require (internal/modules/cjs/helpers.js:92:18) at Object. (/var/www/cloud/apps/recognize/src/efficientnet/EfficientnetModel.js:19:3) at Module._compile (internal/modules/cjs/loader.js:1072:14) at Object.Module._extensions..js (internal/modules/cjs/loader.js:1101:10) at Module.load (internal/modules/cjs/loader.js:937:32) at Function.Module._load (internal/modules/cjs/loader.js:778:12) at Module.require (internal/modules/cjs/loader.js:961:19) { code: 'MODULE_NOT_FOUND', requireStack: [ '/var/www/cloud/apps/recognize/src/efficientnet/EfficientnetModel.js', '/var/www/cloud/apps/recognize/src/classifier_imagenet.js' ] }

at 2021-08-27T06:42:20+02:00

derritter88 avatar Aug 27 '21 04:08 derritter88

So it looks like '@tensorflow/tfjs-node-gpu & @tensorflow/tfjs-backend-wasm are not included in the NC app.

derritter88 avatar Aug 27 '21 05:08 derritter88

I've had to disable GPU for now, because the bundle would exceed the bundle size limit :/

marcelklehr avatar Aug 27 '21 11:08 marcelklehr

I've had to disable GPU for now, because the bundle would exceed the bundle size limit :/

The limitation from the Nextcloud appstore?

derritter88 avatar Aug 27 '21 11:08 derritter88

Yeah

marcelklehr avatar Aug 27 '21 12:08 marcelklehr

Okay, would it be possible that you create a "Github-only" version of it (e.g. xxx-RC1) so I can download and test it?

derritter88 avatar Aug 27 '21 12:08 derritter88

I'll definitely try to make something available. Currently, my problem is that I have to develop that blindly, as I don't have a GPU machine available.

marcelklehr avatar Aug 28 '21 14:08 marcelklehr

If you want you can pack me the thing and I will act as your alpha-/beta tester?!

derritter88 avatar Aug 28 '21 14:08 derritter88

I'm testing it with my NVIDIA GeForce GTX 1660 super (cuda supported even I couldn't find it on the list)

First I have to set up another instance .. I'm using an older version where it still is integrated

arch-user-france1 avatar Sep 21 '21 15:09 arch-user-france1

lol nextcloud apps is down :(

Now I can wait even longer

arch-user-france1 avatar Sep 21 '21 16:09 arch-user-france1

GPU support has to wait until other issues are sorted out, sorry.

marcelklehr avatar Oct 12 '21 20:10 marcelklehr

Okay so for the moment I can remove all necessary Nvidia libraries (except driver)?

derritter88 avatar Oct 13 '21 13:10 derritter88

Okay so for the moment I can remove all necessary Nvidia libraries (except driver)?

For the moment no NVIDIA drivers and libraries are needed, but they won't hurt either, so it's up to you.

marcelklehr avatar Oct 13 '21 13:10 marcelklehr

It's just a bit complex to install different CUDA libraries/versions - that's why I am asking :-) At the moment I sticking with CUDA 11.2 as you have mentioned it in a previous version

derritter88 avatar Oct 13 '21 13:10 derritter88

@marcelklehr just in case of: Windows now supports NVIDIA GPUs within its WSL which I am using. So if you have any tests which I could do just let me know.

derritter88 avatar Nov 18 '21 06:11 derritter88

@derritter88, did you get it working? I've NC in Docker and have been able to get containers gaining access to GPU, i.e. Tensorflow example: docker run --gpus all -it --rm tensorflow/tensorflow:latest-gpu python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"

Similar results get NVIDIA examples:

#docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Maxwell" with compute capability 5.0

> Compute 5.0 CUDA device: [NVIDIA GeForce GTX 960M]
5120 bodies, total time for 10 iterations: 6.155 ms
= 42.591 billion interactions per second
= 851.816 single-precision GFLOP/s at 20 flops per interaction

I've big archive of photos to get processed and running it on CPU is an overkill.

Thanks for hints on how to get it working - am not shy customizing NC container/whatever is needed.

bugsyb avatar Dec 01 '22 22:12 bugsyb

@derritter88, did you get it working? I've NC in Docker and have been able to get containers gaining access to GPU, i.e. Tensorflow example: docker run --gpus all -it --rm tensorflow/tensorflow:latest-gpu python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"

Similar results get NVIDIA examples:

#docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Maxwell" with compute capability 5.0

> Compute 5.0 CUDA device: [NVIDIA GeForce GTX 960M]
5120 bodies, total time for 10 iterations: 6.155 ms
= 42.591 billion interactions per second
= 851.816 single-precision GFLOP/s at 20 flops per interaction

I've big archive of photos to get processed and running it on CPU is an overkill.

Thanks for hints on how to get it working - am not shy customizing NC container/whatever is needed.

Hello @bugsyb ,

thanks for sharing this with me/us. Might be a useful information for some people but unfortunately I do not use Nextcloud as a Docker container. I "just" have a regular dedicated Nextcloud VM. I also had around ~100k of photos/images to classify but my CPU handled that over the last couple of weeks.

I had some discussions with @marcelklehr about it and the major problem would be to have an AI library like Tensorflow which could handled both Nvidia and AMD GPUs

derritter88 avatar Dec 02 '22 07:12 derritter88

Hi @derritter88 ,

Thanks for swift response.

I did take a quick look at what gets installed as part of Recognize and smells like tensorflow-webgl gets there. There is also flag in the code which suggest it should be possible even today: process.env.RECOGNIZE_GPU

Hopes were that given your earlier engagement you'd know how to get Recognize using GPU.

I have also large number of photos to be processed and... well, hoped could leverage GPU which is wasted otherwise.

I run most of apps these days as containers, just for simplicity/dependency and easiness of portability between systems. Happy to share knowledge on the side if you'd be interested.

Re GPUs Nvidia and AMD, tensorflow allows to get it run both natively as well as in container, as demonstrated for Nvidia.

Here is small explanation covering AMD: https://community.amd.com/t5/hsa/tensorflow-with-amd-gpu/td-p/199925 https://medium.com/analytics-vidhya/install-tensorflow-2-for-amd-gpus-87e8d7aeb812 https://www.amd.com/en/technologies/infinity-hub/tensorflow https://tealfeed.com/install-tensorflow-gpu-amd-gpus-vbs7s

There was also other implementation DirectML, though as Internet claims, it was for Windows and WSL which standard Linux wouldn't count in as to be used (am not sure about the latter though).

If we could get started with Nvidia, which is more popular across people who would use it for Linux (not so much gaming ;) ) it would be great, especially as Tensorflow is already available.

I can't help much with AMD as don't have one.

bugsyb avatar Dec 02 '22 19:12 bugsyb

To be honest: I gave up this topic and passed my GPU to a Plex VM for video transcoding but maybe @marcelklehr could improve the general logic of recognize?

derritter88 avatar Dec 02 '22 20:12 derritter88

I have an AMD gpu in a laptop that I use for nextcloud

Doomsdayrs avatar Dec 03 '22 22:12 Doomsdayrs

I have an AMD gpu in a laptop that I use for nextcloud

AMD GPUs probably won't work anyways.

arch-user-france1 avatar Dec 03 '22 23:12 arch-user-france1