recognize icon indicating copy to clipboard operation
recognize copied to clipboard

Classifier process errored - CPU Architecture

Open LotusAxt opened this issue 3 years ago • 46 comments

I installed v1.5.8 but the procces still fails without any usefull error message. :(

{
  "reqId": "b6Wn01s9daui30J6ezBX",
  "level": 2,
  "time": "2021-08-24T07:45:23+00:00",
  "remoteAddr": "",
  "user": "--",
  "app": "recognize",
  "method": "",
  "url": "--",
  "message": "Classifier process errored",
  "userAgent": "--",
  "version": "22.1.0.1",
  "id": "6124ac5f5bb41"
}
{
  "reqId": "b6Wn01s9daui30J6ezBX",
  "level": 2,
  "time": "2021-08-24T07:45:23+00:00",
  "remoteAddr": "",
  "user": "--",
  "app": "recognize",
  "method": "",
  "url": "--",
  "message": "Classifier process error",
  "userAgent": "--",
  "version": "22.1.0.1",
  "id": "6124ac5f5bb0a"
}

LotusAxt avatar Aug 24 '21 08:08 LotusAxt

What's your ~~Nextcloud version and~~ system architecture? Out of curiosity: Did you install via the web UI, via occ or by manually extracting the tarball?

marcelklehr avatar Aug 24 '21 09:08 marcelklehr

My Nextcloud runs in a Debian Buster Docker Container on a Synolgy DS918+, so it should be x64. I tried both installations, first manually and when that didn't work I uninstalled the app in did a reinstall from the web UI.

LotusAxt avatar Aug 24 '21 10:08 LotusAxt

ok.

(The fact that you didn't run into https://github.com/marcelklehr/recognize/issues/52 when installing from the web UI is interesting...)

marcelklehr avatar Aug 24 '21 11:08 marcelklehr

There should be a warning-level log message before these two log messages, saying something like Classifier process output: ...

marcelklehr avatar Aug 24 '21 11:08 marcelklehr

Jep, also not very verbose :-/

{
  "reqId": "b6Wn01s9daui30J6ezBX",
  "level": 2,
  "time": "2021-08-24T07:45:23+00:00",
  "remoteAddr": "",
  "user": "--",
  "app": "recognize",
  "method": "",
  "url": "--",
  "message": "Classifier process output: ",
  "userAgent": "--",
  "version": "22.1.0.1"
}

LotusAxt avatar Aug 24 '21 12:08 LotusAxt

Speaking of verbose: Is there a verbose logging parameter for the occ command?

LotusAxt avatar Aug 24 '21 12:08 LotusAxt

Is there a verbose logging parameter for the occ command?

Not, currently. I agree that it could spit out more information again. The output was reduced by a refactor, but I'll take a look again.

"Classifier process output: "

That would mean that the process fails silently :/ Can you try executing the classifier manually?

$ node recognize/src/classifier_imagenet.js path/to/some/image-file.jpg

(Update: Forgot the node binary...)

Thank you for sponsoring me, btw :heart:

marcelklehr avatar Aug 24 '21 12:08 marcelklehr

Mmh. seems correct. Can you try sudo -u http bin/node-v14.17.4-linux-x64 --version?

And for good measure: lscpu

marcelklehr avatar Aug 24 '21 13:08 marcelklehr

Sorry, I deleted my former post as I realized I executed the command on the host machine and not within the Docker container. 😅

Within the container its:

#: sudo docker exec -i -u www-data nextcloud /var/www/html/apps/recognize/bin/node-v14.17.4-linux
-x64 /var/www/html/apps/recognize/src/classifier_imagenet.js /var/www/html/data/Admin/files/Photos/Frog.jpg
#: sudo docker exec -i -u www-data nextcloud /var/www/html/apps/recognize/bin/node-v14.17.4-linux-x64 --version
v14.17.4
#:
#: sudo docker exec -i -u www-data nextcloud lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       39 bits physical, 48 bits virtual
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               92
Model name:          Intel(R) Celeron(R) CPU J3455 @ 1.50GHz
Stepping:            9
CPU MHz:             1501.000
CPU max MHz:         1501.0000
CPU min MHz:         800.0000
BogoMIPS:            2995.24
Virtualization:      VT-x
L1d cache:           24K
L1i cache:           32K
L2 cache:            1024K
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch intel_pt ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust smep erms mpx rdseed smap clflushopt sha_ni xsaveopt xsavec xgetbv1 dtherm ida arat pln pts md_clear arch_capabilities
#:

For completeness the result on the host was:

#: sudo -u http bin/node-v14.17.4-linux-x64 /volume1/Nextcloud/web/apps/recognize/src/classifier_imagenet.
js /volume1/Nextcloud/data/Admin/files/Photos/Frog.jpg
Illegal instruction
#: sudo -u http bin/node-v14.17.4-linux-x64 --version
v14.17.4

LotusAxt avatar Aug 24 '21 13:08 LotusAxt

It seems we have run into https://github.com/tensorflow/tfjs/issues/2631

marcelklehr avatar Aug 24 '21 13:08 marcelklehr

Two paths forward from here:

  • I will probably add an option in the settings to force running the models in pure-js mode (slooow)
  • More relevant for you: Install node.js and npm and run rm -rf node_modules && npm install in apps/recognize to have npm build libtensorflow according to your hardware specs.

marcelklehr avatar Aug 24 '21 13:08 marcelklehr

So, I installed node.js via nvm and ran the rm -rf node_modules && npm install. The good news: The two error messages in the Nextcloud Log are gone. The bad news: it still dosen't work. :( If I run occ recognize:classify the output is:

Classifying photos of user Admin Failed to classify images Classifier process error

{
  "reqId": "KBbgSJd7nbEibjgv1vJ5",
  "level": 2,
  "time": "2021-08-24T15:43:10+00:00",
  "remoteAddr": "",
  "user": "--",
  "app": "recognize",
  "method": "",
  "url": "--",
  "message": "Classifier process output: ",
  "userAgent": "--",
  "version": "22.1.0.1",
  "id": "6125138e97f97"
}

If I ran your test command node recognize/src/classifier_imagenet.js path/to/some/image-file.jpg:

#$ /var/www/html/apps/recognize/bin/node-v14.17.4-linux-x64 /var/www/html/apps/recognize/src/classifier_imagenet.js /var/www/html/data/Admin/files/Photos/Frog.jpg Illegal instruction (core dumped) #$

I already tried npm rebuild @tensorflow/tfjs-node --build-addon-from-source but that also didn't help.

Any other Ideas?

LotusAxt avatar Aug 24 '21 15:08 LotusAxt

I think that's as deep as we're gonna go on this one. I'll publish a new release soon where you can select pure-js operation. That should definitely work, even though it's slower.

marcelklehr avatar Aug 24 '21 16:08 marcelklehr

Okay, Thank you!

LotusAxt avatar Aug 24 '21 16:08 LotusAxt

now that i have been able to install (raspbian, nc 21.0.4) following this when starting i get an error, nothing in logs: sudo -u www-data /usr/bin/php /var/www/nextcloud/occ recognize:classify Classifying photos of user admin Failed to classify images Classifier process error

spicemint avatar Aug 25 '21 09:08 spicemint

@spicemint Yep, raspi is arm, which does not work, yet. I'm on it.

marcelklehr avatar Aug 25 '21 11:08 marcelklehr

I'm using J5040 (Goldmont Plus architecture), and same thing happens. The source of problem is prebuilt libraries (libtensorflow.so, libtensorflow_framework.so) in tfjs-node library, which requires some specific CPU instructions which some CPUs don't have.

While I can try plain javascript option but I didn't want to because it will be painfully slow anyway. So here's how I did:

  • Follow Optional: Build optimal TensorFlow from source.
  • During ./configure, there will be Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified step. Enter -march=goldmont.
  • Replace deps folder in recongnize folder to what you built.

Voila! result

HelloKS avatar Aug 29 '21 00:08 HelloKS

This may also be a nice resource for downloading pre-built binaries for some architectures: https://github.com/kaufman-lab/build_tensorflow/releases (whl files are simply zip archives)

marcelklehr avatar Aug 29 '21 03:08 marcelklehr

ok.

(The fact that you didn't run into #52 when installing from the web UI is interesting...)

Yes, some versions worked. I also installed the LAST working version with Nextcloud to install Smh it worked lol Just deleted then BCS of bug and couldn't reinstall - made it without UI

arch-user-france1 avatar Sep 17 '21 17:09 arch-user-france1

I'm using J5040 (Goldmont Plus architecture), and same thing happens. The source of problem is prebuilt libraries (libtensorflow.so, libtensorflow_framework.so) in tfjs-node library, which requires some specific CPU instructions which some CPUs don't have.

While I can try plain javascript option but I didn't want to because it will be painfully slow anyway. So here's how I did:

* Follow [Optional: Build optimal TensorFlow from source](https://github.com/tensorflow/tfjs/tree/master/tfjs-node#optional-build-optimal-tensorflow-from-source).

* During `./configure`, there will be `Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified` step. Enter `-march=goldmont`.

* Replace deps folder in recongnize folder to what you built.

Voila! result

Does this improve any performance (I just installed it by Package and not built byself and I don't have to use JavaScript mode)

arch-user-france1 avatar Sep 17 '21 17:09 arch-user-france1

I'm currently setting up a repo for building various flavors of libtensorflow. It would be useful to know which kinds people need.

If you'd like your machine to be covered, run the following on your machine and post the output here:

gcc -march=native -Q --help=target | grep march

(cc @jakobroehrl)

marcelklehr avatar Oct 28 '21 14:10 marcelklehr

Hey. I am using the docker alpine nextcloud image. Is is sufficent to do apk add libc6-compat to get it running, besides the tensorflow problem?

Output of gcc: skylake-avx512

Because currently when I try to recognize:classify I get:

Classifying photos of user emporea
Failed to classify images
Classifier process error

Emporea avatar Oct 29 '21 01:10 Emporea

root@server:~# gcc -march=native -Q --help=target | grep march -march= nehalem Known valid arguments for -march= option: root@server:~#

jakobroehrl avatar Oct 29 '21 05:10 jakobroehrl

This may also be a nice resource for downloading pre-built binaries for some architectures: https://github.com/kaufman-lab/build_tensorflow/releases (whl files are simply zip archives)

How to use/install them? Thanks

jakobroehrl avatar Oct 29 '21 12:10 jakobroehrl

I've forked that repository and setup a better pipeline to streamline this: https://github.com/marcelklehr/build_tensorflow/actions

marcelklehr avatar Oct 31 '21 14:10 marcelklehr

I'm getting a log full of these too. Using NC version 23.0.3 and the latest Recognize installed via the Apps page within NC itself.

image

If I manually run: occ recognize:classify-images

It scrolls through many images found, finally ending with this:

  8073 => '/data/__groupfolders/1/Pets/Tucker/20210321_094827.jpg',
)
Running array (
  0 => '',
  1 => '/config/www/nextcloud/apps/recognize/src/classifier_imagenet.js',
  2 => '-',
)
sh: taskset: not found
Classifier process output: sh: exec: line 1: : Permission denied

Classifier process output: sh: exec: line 1: : Permission denied

Failed to classify images
Classifier process error

I'm using the Nextcloud Docker container, on Win 10 x64, WSL (Ubuntu containers).

YouveGotMeowxy avatar Mar 30 '22 19:03 YouveGotMeowxy

@YouveGotMeowxy As the title says, most likely, your CPU architecture is not supported by the standard tensorflow build. That means you have two options

  1. Run in JS mode (Can be enabled in the admin settings for recognize)
  2. Use a custom build of tensorflow for your architecture, you'll have to compile from source (This repo may help, but I can't help with the details atm)

marcelklehr avatar Mar 30 '22 19:03 marcelklehr

@YouveGotMeowxy As the title says, most likely, your CPU architecture is not supported by the standard tensorflow build.

I'm just running it within an Ubuntu container on WSL (AMD Ryzen processor). Standard Tensorflow doesn't support that?

Also, are there any drawbacks to using the js mode? And if I use JS mode, will the OCC manual recognize still work?

YouveGotMeowxy avatar Mar 30 '22 19:03 YouveGotMeowxy

Side question:

Any chance on helping to simplify our lives just "that much more" by adding simple buttons we can click to manually do these things? (buttons where the pink are). :-p

image

Save us from the gory, ugly command line. Copying and pasting commands, etc. lol

YouveGotMeowxy avatar Mar 30 '22 20:03 YouveGotMeowxy

I'm just running it within an Ubuntu container on WSL (AMD Ryzen processor). Standard Tensorflow doesn't support that?

I have no idea if tensorflow supports your CPU. lscpu will tell you which instructions your CPU supports. Tensorflow needs avx and possibly avx512 or something like that. Don't remember off the top of my head. A different reason for the Permission denied error could be that it's actually about permissions. Recognize makes sure that all permissions are set on the node.js binary, but you can try whether executing recognize/bin/node --version works. If it does work, your CPU is to blame, if it does not, the installation failed.

Also, are there any drawbacks to using the js mode? And if I use JS mode, will the OCC manual recognize still work?

JS mode is slower, but other than that should work fine.

marcelklehr avatar Mar 30 '22 20:03 marcelklehr