fishtest icon indicating copy to clipboard operation
fishtest copied to clipboard

cutechess-cli for raspberry pi (request for help).

Open vdbergh opened this issue 1 year ago • 6 comments

Occasionally there are requests for running the worker on a RPI. To make this possible we need a cutechess-cli binary for the RPI. Perhaps someone who is familiar with the RPI architecture can look into this?

Instructions for cross-compiling seem to be most desirable. In that way the binary can be produced by developers that do not own a RPI.

vdbergh avatar Jul 30 '22 16:07 vdbergh

I once tried a native compile, but gave up since the build time was very large. I can install it by running sudo apt-get install cutechess

vondele avatar Jul 30 '22 16:07 vondele

Are raspberry pi's even fast enough for fishtest min nps?

Disservin avatar Jul 30 '22 16:07 Disservin

@vondele I browsed around on the RPI repository. At first sight this seems to be the source code

http://sourcearchive.raspbian.org/main/c/cutechess/

It is from 2013. This is not recent enough.

vdbergh avatar Jul 30 '22 17:07 vdbergh

yeah, it is probably pretty old:

$ cutechess-cli --version
cutechess-cli 0.4.2
Using Qt version 4.8.7

vondele avatar Jul 30 '22 17:07 vondele

I tried running fishtest on my pi 4 with a natively built cutechess-cli. It still doesn't work, as the makefile defaults on x86 instead of using aarch64 (?)

Available Makefile architecture targets:  ['x86-64-vnni512', 'x86-64-vnni256', 'x86-64-avx512', 'x86-64-avxvnni', 'x86-64-bmi2', 'x86-64-avx2', 'x86-64-sse41-popcnt', 'x86-64-modern', 'x86-64-ssse3', 'x86-64-sse3-popcnt', 'x86-64', 'x86-32-sse41-popcnt', 'x86-32-sse2', 'x86-32', 'ppc-64', 'ppc-32', 'armv7', 'armv7-neon', 'armv8', 'e2k', 'apple-silicon', 'general-64', 'general-32']
Available g++/cpu properties:  {'flags': ['neon', 'outline-atomics'], 'arch': 'generic'}
Determined the best architecture to be  x86-64
Default net: nn-ad9b42354671.nnue
Already available.

Config:
debug: 'no'
sanitize: 'none'
optimize: 'yes'
arch: 'x86_64'
bits: '64'
kernel: 'Linux'
os: 'GNU/Linux'
prefetch: 'yes'
popcnt: 'no'
pext: 'no'
sse: 'yes'
mmx: 'no'
sse2: 'yes'
ssse3: 'no'
sse41: 'no'
avx2: 'no'
avxvnni: 'no'
avx512: 'no'
vnni256: 'no'
vnni512: 'no'
neon: 'no'
arm_version: '0'

Flags:
CXX: clang++
CXXFLAGS: -DNNUE_EMBEDDING_OFF -Wall -Wcast-qual -fno-exceptions -std=c++17  -pedantic -Wextra -Wshadow -m64 -DUSE_PTHREADS -DNDEBUG -O3 -fexperimental-new-pass-manager -DIS_64BIT -msse -DUSE_SSE2 -msse2 -flto
LDFLAGS:   -latomic -m64 -lpthread -DNNUE_EMBEDDING_OFF -Wall -Wcast-qual -fno-exceptions -std=c++17  -pedantic -Wextra -Wshadow -m64 -DUSE_PTHREADS -DNDEBUG -O3 -fexperimental-new-pass-manager -DIS_64BIT -msse -DUSE_SSE2 -msse2 -flto

Testing config sanity. If this fails, try 'make help' ...


Step 1/4. Building instrumented executable ...
make ARCH=x86-64 COMP=clang clang-profile-make

nimnananuk avatar Aug 26 '22 23:08 nimnananuk

It might be helpful to post the output of

g++ -Q -march=native --help=target

and

clang++ -E - -march=native -###

vdbergh avatar Aug 30 '22 18:08 vdbergh

Hey guys, I'm new around here so nice to meet you all!

I was trying to run fishtest on my Raspberry pi 4 and I had the same problems with cutechess-cli. After reading your comments, I've built cutechess-cli on that raspberry pi and apparently it worked after moving the binary to fishtest/worker/testing :)

The compressed binary is attached here if you want to try it out as well. It was built from the latest source code available and here's the output of cutechess-cli --version

cutechess-cli 1.3.0-beta2
Using Qt version 5.15.6
Running on Arch Linux ARM/arm

As @vdbergh suggested, here's the output of g++ -Q -march=native --help=target on that raspberry pi

The following options are target specific:
  -mabi=                      		aapcs-linux
  -mabort-on-noreturn         		[disabled]
  -mandroid                   		[disabled]
  -mapcs                      		[disabled]
  -mapcs-frame                		[disabled]
  -mapcs-reentrant            		[disabled]
  -mapcs-stack-check          		[disabled]
  -march=                     		armv8-a+crc+simd
  -marm                       		[enabled]
  -masm-syntax-unified        		[disabled]
  -mbe32                      		[enabled]
  -mbe8                       		[disabled]
  -mbig-endian                		[disabled]
  -mbionic                    		[disabled]
  -mbranch-cost=              		-1
  -mcallee-super-interworking 		[disabled]
  -mcaller-super-interworking 		[disabled]
  -mcmse                      		[disabled]
  -mcpu=                      		
  -mfdpic                     		[disabled]
  -mfix-cmse-cve-2021-35465   		[disabled]
  -mfix-cortex-a57-aes-1742098 		[disabled]
  -mfix-cortex-a72-aes-1655431 		-mfix-cortex-a57-aes-1742098
  -mfix-cortex-m3-ldrd        		[disabled]
  -mflip-thumb                		[disabled]
  -mfloat-abi=                		hard
  -mfp16-format=              		none
  -mfpu=                      		neon
  -mgeneral-regs-only         		[disabled]
  -mglibc                     		[enabled]
  -mhard-float                		-mfloat-abi=hard
  -mlibarch=                  		armv8-a+crc+simd
  -mlittle-endian             		[enabled]
  -mlong-calls                		[disabled]
  -mmusl                      		[disabled]
  -mneon-for-64bits           		[disabled]
  -mpic-data-is-text-relative 		[enabled]
  -mpic-register=             		
  -mpoke-function-name        		[disabled]
  -mprint-tune-info           		[disabled]
  -mpure-code                 		[disabled]
  -mrestrict-it               		[disabled]
  -msched-prolog              		[enabled]
  -msingle-pic-base           		[disabled]
  -mslow-flash-data           		[disabled]
  -msoft-float                		-mfloat-abi=soft
  -mstack-protector-guard-offset= 	
  -mstack-protector-guard=    		global
  -mstructure-size-boundary=  		8
  -mthumb                     		[disabled]
  -mthumb-interwork           		[disabled]
  -mtls-dialect=              		gnu
  -mtp=                       		cp15
  -mtpcs-frame                		[disabled]
  -mtpcs-leaf-frame           		[disabled]
  -mtune=                     		
  -muclibc                    		[disabled]
  -munaligned-access          		[enabled]
  -mvectorize-with-neon-double 		[disabled]
  -mvectorize-with-neon-quad  		[enabled]
  -mword-relocations          		[enabled]

  Known ARM ABIs (for use with the -mabi= option):
    aapcs aapcs-linux apcs-gnu atpcs iwmmxt

  Known __fp16 formats (for use with the -mfp16-format= option):
    alternative ieee none

  Known ARM FPUs (for use with the -mfpu= option):
    auto crypto-neon-fp-armv8 fp-armv8 fpv4-sp-d16 fpv5-d16 fpv5-sp-d16 neon neon-fp-armv8 neon-fp16 neon-vfpv3 neon-vfpv4 vfp vfp3 vfpv2 vfpv3 vfpv3-d16
    vfpv3-d16-fp16 vfpv3-fp16 vfpv3xd vfpv3xd-fp16 vfpv4 vfpv4-d16

  Valid arguments to -mtp=:
    auto cp15 soft

  Known floating-point ABIs (for use with the -mfloat-abi= option):
    hard soft softfp

  Valid arguments to -mstack-protector-guard=:
    global tls

  TLS dialect to use:
    gnu gnu2

And the output of clang++ -E - -march=native -###

clang version 14.0.6
Target: armv7l-unknown-linux-gnueabihf
Thread model: posix
InstalledDir: /usr/bin
 (in-process)
 "/usr/bin/clang-14" "-cc1" "-triple" "armv8-unknown-linux-gnueabihf" "-E" "-disable-free" "-clear-ast-before-backend" "-disable-llvm-verifier" "-discard-value-names" "-main-file-name" "-" "-mrelocation-model" "pic" "-pic-level" "2" "-pic-is-pie" "-mframe-pointer=all" "-fmath-errno" "-ffp-contract=on" "-fno-rounding-math" "-mconstructor-aliases" "-target-cpu" "generic" "-target-feature" "+vfp2" "-target-feature" "+vfp2sp" "-target-feature" "+vfp3" "-target-feature" "+vfp3d16" "-target-feature" "+vfp3d16sp" "-target-feature" "+vfp3sp" "-target-feature" "+fp16" "-target-feature" "+vfp4" "-target-feature" "+vfp4d16" "-target-feature" "+vfp4d16sp" "-target-feature" "+vfp4sp" "-target-feature" "+fp-armv8" "-target-feature" "+fp-armv8d16" "-target-feature" "+fp-armv8d16sp" "-target-feature" "+fp-armv8sp" "-target-feature" "-fullfp16" "-target-feature" "+fp64" "-target-feature" "+d32" "-target-feature" "+neon" "-target-feature" "+sha2" "-target-feature" "+aes" "-target-feature" "-fp16fml" "-target-abi" "aapcs-linux" "-mfloat-abi" "hard" "-fallow-half-arguments-and-returns" "-debugger-tuning=gdb" "-fcoverage-compilation-dir=/home/mammoth/cutechess/build" "-resource-dir" "/usr/lib/clang/14.0.6" "-internal-isystem" "/usr/lib/clang/14.0.6/include" "-internal-isystem" "/usr/local/include" "-internal-isystem" "/usr/bin/../lib/gcc/armv7l-unknown-linux-gnueabihf/12.1.0/../../../../armv7l-unknown-linux-gnueabihf/include" "-internal-externc-isystem" "/include" "-internal-externc-isystem" "/usr/include" "-fdebug-compilation-dir=/home/mammoth/cutechess/build" "-ferror-limit" "19" "-stack-protector" "2" "-fno-signed-char" "-fgnuc-version=4.2.1" "-fcolor-diagnostics" "-faddrsig" "-o" "-" "-x" "c" "-"

Hope it helps! :)

ocaio avatar Oct 27 '22 13:10 ocaio

Now I'm facing a different problem on my Raspberry Pi 4, the same one that @nimnananuk reported.

Apparently games.py is not selecting the proper architecture, so the compiler breaks further on the execution. It's selecting x86-32 by default, here's a partial output after starting the worker (using cutechess-cli binary provided above)

Available Makefile architecture targets:  ['x86-64-vnni512', 'x86-64-vnni256', 'x86-64-avx512', 'x86-64-avxvnni', 'x86-64-bmi2', 'x86-64-avx2', 'x86-64-sse41-popcnt', 'x86-64-modern', 'x86-64-ssse3', 'x86-64-sse3-popcnt', 'x86-64', 'x86-32-sse41-popcnt', 'x86-32-sse2', 'x86-32', 'ppc-64', 'ppc-32', 'armv7', 'armv7-neon', 'armv8', 'e2k', 'apple-silicon', 'general-64', 'general-32', 'riscv64']
Available g++/cpu properties:  {'flags': ['arm', 'be32', 'glibc', 'little-endian', 'pic-data-is-text-relative', 'sched-prolog', 'unaligned-access', 'vectorize-with-neon-quad', 'word-relocations'], 'arch': 'armv8-a+crc+simd'}
Determined the best architecture to be  x86-32
Default net: nn-ad9b42354671.nnue

I'll try to get it working and post updates here soon, meanwhile any tips or advice you might have is more than welcome!

ocaio avatar Oct 27 '22 13:10 ocaio

I was able to run the worker using armv7-neon architecture during the worker execution. The build process was a success (but quite slow), but apparently Raspberry Pi 4 is not powerful enough for fishtest :cry:

Exception running games:
This machine is too slow (189264.0 nps / thread) to run fishtest effectively - sorry!
Informing the server
Heartbeat stopped
Post request https://tests.stockfishchess.org:443/api/failed_task handled in 988.10ms (server: 2.36ms)
Task exited
Waiting for the heartbeat thread to finish...
Deleting lock file /home/mammoth/fishtest/worker/worker.lock

Edit: after overclocking the Raspberry Pi 4 to work on 2GHz speed (over 1.5GHz by default) there was a performance improvement but it was still not enough...

Exception running games:
This machine is too slow (238685.0 nps / thread) to run fishtest effectively - sorry!
Informing the server
Heartbeat stopped
Post request https://tests.stockfishchess.org:443/api/failed_task handled in 995.33ms (server: 2.00ms)
Task exited
Waiting for the heartbeat thread to finish...
Deleting lock file /home/mammoth/fishtest/worker/worker.lock

ocaio avatar Oct 27 '22 15:10 ocaio

Hey guys, sorry for over-posting here, but just one final thought that occurred to me: running Raspberry Pi's Broadcom BCM2711 Cortex-A72 processor on 64-bits mode (armv8), instead of the default 32-bit mode (armv7), in which some people have seen some performance improvements.

Here's the locally compiled version of cutechess-cli for armv8:

cutechess-cli 1.3.0-beta2
Using Qt version 5.15.6
Running on Arch Linux ARM/arm64

But still too slow for Fishtest :cry: Here's the partial output of worker.py when running on regular clock speed (1.5GHz):

Exception running games:
This machine is too slow (219427.0 nps / thread) to run fishtest effectively - sorry!
Informing the server
Heartbeat stopped
Post request https://tests.stockfishchess.org:443/api/failed_task handled in 1016.36ms (server: 13.88ms)
Task exited
Waiting for the heartbeat thread to finish...
Deleting lock file /home/mammoth/fishtest/worker/worker.lock

And here's the same output when running overclocked (2GHz):

Exception running games:
This machine is too slow (249853.0 nps / thread) to run fishtest effectively - sorry!
Informing the server
Post request https://tests.stockfishchess.org:443/api/failed_task handled in 880.70ms (server: 3.37ms)
Task exited
Waiting for the heartbeat thread to finish...
Heartbeat stopped
Deleting lock file /home/mammoth/fishtest/worker/worker.lock

Once again, hope it helps somehow!

Edit: one curious thing, there was a performance improvement when running with --concurrency 1 -m MAX parameters (on previous executions, concurrency was set to 3)! Anyway, it was not enough still.

Exception running games:
This machine is too slow (370488.0 nps / thread) to run fishtest effectively - sorry!
Informing the server
Heartbeat stopped

ocaio avatar Nov 02 '22 15:11 ocaio

Hi @ocaio thanks for your posts and for posting the cutechess-cli binary! It is a bit sad that the rPI seems to be too slow for Fishtest though (although it comes close). I don't have a rPI but I tried your binary in qemu-arm. It did not work as it depends on an arm version of Qt. Perhaps you can post this as well?

vdbergh avatar Nov 02 '22 16:11 vdbergh

Sure, @vdbergh! I've installed the Qt5 dependencies (qt5-base and qt5-svg) from the official Arch Linux ARM package repository, and they are attached here as well (for aarch64).

Do you need the 32-bit version too?

ocaio avatar Nov 02 '22 18:11 ocaio