leela-chess
leela-chess copied to clipboard
Core dump when running leela-chess on Google Colab
Hi folks,
When I downloaded the 118 weights
%%bash
apt install libboost-all-dev libopenblas-dev opencl-headers ocl-icd-libopencl1 ocl-icd-opencl-dev zlib1g-dev
apt install clinfo && clinfo
apt install cmake
git clone https://github.com/glinscott/leela-chess.git
cd leela-chess
git submodule update --init --recursive
mkdir -p build&& cd build
cmake ..
make
./tests
wget -O weights.txt.gz http://lczero.org/get_network?sha=1d1b1a4d9d708ef04d7714b604bddea29122ec2027369e111197f7b9537b1bf8
gunzip weights.txt.gz
cp ../scripts/train.sh .
./train.sh
I got the following error message:
Using 1 thread(s).
Generated 1924 moves
Detecting residual layers...v1...64 channels...Using 1 thread(s).
Generated 1924 moves
Detecting residual layers...v1...64 channels...6 blocks.
6 blocks.
Initializing OpenCL.
OpenCL: clGetPlatformIDs
terminate called after throwing an instance of 'cl::Error'
what(): clGetPlatformIDs
./train.sh: line 13: 2659 Aborted (core dumped) ./lczero --weights=weights.txt --randomize -n -t1 --start="train 1" > training.out
Initializing OpenCL.
OpenCL: clGetPlatformIDs
terminate called after throwing an instance of 'cl::Error'
what(): clGetPlatformIDs
./train.sh: line 13: 2660 Aborted (core dumped) ./lczero --weights=weights.txt --randomize -n -t1 --start="train 2" > training2.out
Have you tried the tensorflow version? (lc0 folder)
The advantage of using Google Colab is that everyone with a google account gets free access to 4 GPUs, so if this is actually an issue with leela-chess (and not just my mistake), then it is worth correcting, because you will instantly gain 4 more GPUs for every person who runs leela-chess.
mkdir -p run && cd run
cp ~/leela-chess/build/lczero .
wget -O client_linux https://github.com/glinscott/leela-chess/releases/download/v0.4/client_linux
chmod +x client_linux && ./client_linux --user djinnome --password XXXX --gpu 1
results in a problem with OpenCL attempting to get the platform ID
Args: [/content/run/lczero --weights=networks/94c816e13232334d6b69353c23ee3185afbc3dd3ab104125131bb93aa1c26e8f -t1 --randomize -n -v1600 -l/content/run/logs-2619/20180411034925.log --start=train 2619-0 1 --gpu=0]
Logging to /content/run/logs-2619/20180411034925.log.
Using 1 thread(s).
Generated 1924 moves
Detecting residual layers...v1...64 channels...6 blocks.
Initializing OpenCL.
OpenCL: clGetPlatformIDs
terminate called after throwing an instance of 'cl::Error'
what(): clGetPlatformIDs
2018/04/11 03:49:27 signal: aborted (core dumped)
Just to prove that I really do have a GPU:
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))
results in:
Found GPU at: /device:GPU:0
hey @Akababa Thanks for the suggestion.
It appears that leela-chess/lc0/build.sh
needs a few extra packages:
apt install meson ninja-build clang
Now when I try to run ./build.sh
, I get the following error:
The Meson build system
Version: 0.42.1
Source dir: /content/leela-chess/lc0
Build dir: /content/leela-chess/lc0/build
Build type: native build
Project name: lc0
Meson encountered an error in file meson.build, line 1, column 0:
Value "c++17" for combo option "cpp_std" is not one of the choices. Possible choices are: "none", "c++03", "c++11", "c++14", "c++1z", "gnu++11", "gnu++14", "gnu++1z".
ninja: error: loading 'build.ninja': No such file or directory
Any suggestions?
This is the entirety of build.sh
:
#!/usr/bin/bash
rm -fr build
CC=clang CXX=clang++ meson build --buildtype release
# CC=clang CXX=clang++ meson build --buildtype debugoptimized
cd build
ninja
Looking at
leela-chess/lc0/meson.build
I see that I need to install the tensorflow
from source, because my /usr/local
does not contain the files that are expected below.
I am also wondering what I need to upgrade/install so that c++17
is an acceptable value for cpp_std
as per above.
project('lc0', 'cpp',
default_options : ['c_std=c17', 'cpp_std=c++17'])
# add_global_arguments('-Wno-macro-redefined', language : 'cpp')
cc = meson.get_compiler('cpp')
# Installed from https://github.com/FloopCZ/tensorflow_cc
tensorflow_cc = declare_dependency(
include_directories: include_directories(
'/usr/local/include/tensorflow',
'/usr/local/include/tensorflow/bazel-genfiles',
'/usr/local/include/tensorflow/tensorflow/contrib/makefile/downloads',
'/usr/local/include/tensorflow/tensorflow/contrib/makefile/downloads/eigen',
'/usr/local/include/tensorflow/tensorflow/contrib/makefile/downloads/gemmlowp',
'/usr/local/include/tensorflow/tensorflow/contrib/makefile/downloads/nsync/public',
'/usr/local/include/tensorflow/tensorflow/contrib/makefile/gen/protobuf-host/include',
),
dependencies: [
cc.find_library('libtensorflow_cc', dirs: '/usr/local/lib/tensorflow_cc/'),
cc.find_library('dl'),
cc.find_library('pthread'),
cc.find_library('libprotobuf', dirs: '/usr/local/lib/tensorflow_cc/'),
],
)
deps = []
deps += tensorflow_cc
deps += cc.find_library('stdc++fs')
# deps += dependency('libprofiler')
Meson v45 seems not to know about c++17 yet. Can be worked around with:
project('lc0', 'cpp')
add_global_arguments('-std=c++17', language : 'cpp')
I'll change that in the config.
Fyi if you want to build tensorflow version of lc0, here in my fork https://github.com/mooskagh/leela-chess is it with some fixes. I still suspect there's something wrong with it (like it often blunders in won position), but you may want to try.
It looks like the opencl drivers are not working. What does clinfo
give?
Hi @glinscott
You are right. clinfo
returns the following:
Number of platforms 0
However, tensorflow can find the GPU:
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))
results in:
Found GPU at: /device:GPU:0
OK,
I ran apt install nvidia-cuda-toolkit
and now clinfo
returns the following:
Number of platforms 1
Platform Name NVIDIA CUDA
Platform Vendor NVIDIA Corporation
Platform Version OpenCL 1.2 CUDA 9.0.282
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer
Platform Extensions function suffix NV
Platform Name NVIDIA CUDA
Number of devices 1
Device Name Tesla K80
Device Vendor NVIDIA Corporation
Device Vendor ID 0x10de
Device Version OpenCL 1.2 CUDA
Driver Version 384.111
Device OpenCL C Version OpenCL C 1.2
Device Type GPU
Device Profile FULL_PROFILE
Device Topology (NV) PCI-E, 00:00.4
Max compute units 13
Max clock frequency 823MHz
Compute Capability (NV) 3.7
Device Partition (core)
Max number of sub-devices 1
Supported partition types None
Max work item dimensions 3
Max work item sizes 1024x1024x64
Max work group size 1024
Preferred work group size multiple 32
Warp size (NV) 32
Preferred / native vector sizes
char 1 / 1
short 1 / 1
int 1 / 1
long 1 / 1
half 0 / 0 (n/a)
float 1 / 1
double 1 / 1 (cl_khr_fp64)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations Yes
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Address bits 64, Little-Endian
Global memory size 11995578368 (11.17GiB)
Error Correction support Yes
Max memory allocation 2998894592 (2.793GiB)
Unified memory for Host and Device No
Integrated memory (NV) No
Minimum alignment for any data type 128 bytes
Alignment of base address 4096 bits (512 bytes)
Global Memory cache type Read/Write
Global Memory cache size 212992
Global Memory cache line 128 bytes
Image support Yes
Max number of samplers per kernel 32
Max size for 1D images from buffer 134217728 pixels
Max 1D or 2D image array size 2048 images
Max 2D image size 16384x16384 pixels
Max 3D image size 4096x4096x4096 pixels
Max number of read image args 256
Max number of write image args 16
Local memory type Local
Local memory size 49152 (48KiB)
Registers per block (NV) 65536
Max constant buffer size 65536 (64KiB)
Max number of constant args 9
Max size of kernel argument 4352 (4.25KiB)
Queue properties
Out-of-order execution Yes
Profiling Yes
Prefer user sync for interop No
Profiling timer resolution 1000ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
Kernel execution timeout (NV) No
Concurrent copy and kernel execution (NV) Yes
Number of async copy engines 2
printf() buffer size 1048576 (1024KiB)
Built-in kernels
Device Available Yes
Compiler Available Yes
Linker Available Yes
Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) NVIDIA CUDA
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [NV]
clCreateContext(NULL, ...) [default] Success [NV]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) No platform
ICD loader properties
ICD loader Name OpenCL ICD Loader
ICD loader Vendor OCL Icd free software
ICD loader Version 2.2.11
ICD loader Profile OpenCL 2.1
The full successful workflow is as follows:
apt install cmake nvidia-cuda-toolkit git-all libboost-all-dev libopenblas-dev opencl-headers ocl-icd-libopencl1 ocl-icd-opencl-dev zlib1g-dev
apt install clinfo && clinfo
followed by:
git clone https://github.com/glinscott/leela-chess.git
cd leela-chess
git submodule update --init --recursive
mkdir -p build && cd build
cmake ..
make
followed by
cp leela-chess/build/lczero .
wget -c https://github.com/glinscott/leela-chess/releases/download/v0.6/client_linux
cp leela-chess/build/lczero .
chmod +x client_linux && ./client_linux --user <your username> --password XXX --debug
Thanks @blin00 for putting up the wiki page
You can also copy the notebook I got working , which includes saving to google drive and the tuning step from @blin00 .
One of my colab notebooks is throwing an error trying to make leela-chess (other 2 work fine):
!cd leela-chess && cd build && make
In file included from /content/leela-chess/src/OpenCL.h:27:0,
from /content/leela-chess/src/OpenCLScheduler.h:26,
from /content/leela-chess/src/Network.cpp:49:
/content/leela-chess/src/CL/cl2.hpp:5857:63: warning: ignoring attributes on template argument ‘cl_int {aka int}’ [-Wignored-attributes]
typename std::enable_if<!std::is_pointer<T>::value, cl_int>::type
^
/content/leela-chess/src/CL/cl2.hpp:6157:22: warning: ignoring attributes on template argument ‘cl_int {aka int}’ [-Wignored-attributes]
vector<cl_int>* binaryStatus = NULL,
^
/content/leela-chess/src/Network.cpp: In static member function ‘static void Network::initialize()’:
/content/leela-chess/src/Network.cpp:507:33: error: ‘openblas_get_corename’ was not declared in this scope
myprintf("BLAS Core: %s\n", openblas_get_corename());
^~~~~~~~~~~~~~~~~~~~~
/content/leela-chess/src/Network.cpp:507:33: note: suggested alternative: ‘openblas_set_num_threads’
myprintf("BLAS Core: %s\n", openblas_get_corename());
^~~~~~~~~~~~~~~~~~~~~
openblas_set_num_threads
CMakeFiles/objs.dir/build.make:134: recipe for target 'CMakeFiles/objs.dir/src/Network.cpp.o' failed
make[2]: *** [CMakeFiles/objs.dir/src/Network.cpp.o] Error 1
CMakeFiles/Makefile2:104: recipe for target 'CMakeFiles/objs.dir/all' failed
make[1]: *** [CMakeFiles/objs.dir/all] Error 2
Makefile:129: recipe for target 'all' failed
make: *** [all] Error 2
Any ideas? Restarting runtime does not work, creating new notebook from scratch results in same error. Other notebooks work just fine, this one used to also before a disconnect.
Just a guess, but maybe try !make clean
before rerunning make?
If not, then !rm -rf leela-chess
and !git clone https://github.com/glinscott/leela-chess.git
then !mkdir -p leela-chess/build && cd leela-chess/build && cmake ..
followed by
!cd leela-chess/build && make
Inserting:
!rm -rf leela-chess
after apt-install and before git clone -blocks and
!make clean
!cd leela-chess && cd build && make clean
before cmake-block did the trick. 1000+ nps rolling smoothly.
(c)make must have been corrupted/confused somehow. Maybe the runtime did not restart cleanly or something along those lines. Thanks.
sir
i cannot connect to google colab any more, no matter how hard i try
i get the message failed to assign a backend each time
apart from this, when i was connected, google colab kept disconnecting, sometimes after only 4 minutes
the longest uninterrupted connection was around 2 hours
can u or someone else fix these 2 issues please?
The go version needed ubuntu 18.xxx script and not sure whether this is the same for chess; see this in the middle about major change in script for it to work: https://github.com/gcp/leela-zero/issues/1923