lz
lz copied to clipboard
tensor-accum-0.17/dev+/uniform questions/discussions
Hi I tried to build your fastexit-tensor-accum+ on ubuntu 16.04. going by the steps in the readme, see below the error message. But the build fails with the following errors. Any idea how to fix this?
cmake --build . [ 3%] Built target gtest [ 7%] Built target gtest_main [ 9%] Building CXX object CMakeFiles/objs.dir/src/UCTSearch.cpp.o lz/src/UCTSearch.cpp:268:45: warning: unused parameter ‘thread_num’ [-Wunused-parameter] int thread_num) { ^ lz/src/UCTSearch.cpp: In member function ‘int UCTSearch::think(int, UCTSearch::passflag_t)’: lz/src/UCTSearch.cpp:860:18: error: converting to ‘std::queue<std::unique_ptr<BackupData> >’ from initializer list would use explicit constructor ‘std::queue<_Tp, _Sequence>::queue(_Sequence&&) [with _Tp = std::unique_ptr<BackupData>; _Sequence = std::deque<std::unique_ptr<BackupData>, std::allocator<std::unique_ptr<BackupData> > >]’ backup_queue = {}; ^ lz/src/UCTSearch.cpp: In member function ‘void UCTSearch::ponder()’: lz/src/UCTSearch.cpp:944:18: error: converting to ‘std::queue<std::unique_ptr<BackupData> >’ from initializer list would use explicit constructor ‘std::queue<_Tp, _Sequence>::queue(_Sequence&&) [with _Tp = std::unique_ptr<BackupData>; _Sequence = std::deque<std::unique_ptr<BackupData>, std::allocator<std::unique_ptr<BackupData> > >]’ backup_queue = {}; ^ At global scope: cc1plus: warning: unrecognized command line option ‘-Wno-mismatched-tags’ cc1plus: warning: unrecognized command line option ‘-Wno-ignored-attributes’ CMakeFiles/objs.dir/build.make:254: recipe for target 'CMakeFiles/objs.dir/src/UCTSearch.cpp.o' failed make[2]: *** [CMakeFiles/objs.dir/src/UCTSearch.cpp.o] Error 1 CMakeFiles/Makefile2:143: recipe for target 'CMakeFiles/objs.dir/all' failed make[1]: *** [CMakeFiles/objs.dir/all] Error 2 Makefile:149: recipe for target 'all' failed make: *** [all] Error 2
Build instructions from the readme:
sudo apt install clinfo && clinfo
git clone https://github.com/gcp/leela-zero cd leela-zero git submodule update --init --recursive
sudo apt install libboost-dev libboost-program-options-dev libboost-filesystem-dev opencl-headers ocl-icd-libopencl1 ocl-icd-opencl-dev zlib1g-dev
mkdir build && cd build cmake .. cmake --build . ./tests curl -O https://zero.sjeng.org/best-network ./leelaz --weights best-network
Yeah some compilers can't deal with this. I suggest changing backup_queue={};
to
while(!backup_queue.empty()) { backup_queue.pop(); }
in both think()
and ponder()
.
Not sure if this is less efficient or not.
May I ask what compiler you are using? I now tried with gcc 5.4. and clang 3.8. Even after changing backup_queue={};
there is a new error with both compilers.
In file included from lz/src/OpenCL.cpp:36: lz/src/OpenCL.h:77:22: error: implicit instantiation of undefined template 'std::atomic
' std::atomic m_occupied{0}; ^ /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/atomic_base.h:126:12: note: template is declared here struct atomic; ^ In file included from lz/src/OpenCL.cpp:36: lz/src/OpenCL.h:78:22: error: implicit instantiation of undefined template 'std::atomic ' std::atomic idle_count{0}; ^ /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/atomic_base.h:126:12: note: template is declared here struct atomic; ^
I think the error indicates you need `#include <atomic_base> in OpenCL.h. People compiled successfully on Ubuntu before; gcc 8.1.0 seems to be working.
Thank you it works with #include
Thank you for testing! There definitely remains work to be done. Can you tell me what GPUs you have, what other branches (gcp/next, ihavnoid/batch-full, ihavnoid/tensorcore, or else?) you are comparing my branch with, and what parameters (--batchsize, -t) you are using in each case?
You may now try https://github.com/alreadydone/lz/tree/tensor-accum-dev+.
Tested on Google Cloud:
15270 pos/s with 4xV100, 256x19 net, and command
./leelaz -w ../../990.gz --batchsize 12 --gpu 0 --gpu 1 --gpu 2 --gpu 3 --benchmark -v 200000 --worker 4
38865 n/s, 27054 pos/s with 8xV100, 256x19 net, and command
./leelaz --gpu 0 --gpu 1 --gpu 2 --gpu 3 --gpu 4 --gpu 5 --gpu 6 --gpu 7 --worker 3 --batchsize 32 --benchmark -v 200000 -w ../../990.gz
(both with 24vCPUs)
You can specify --batchsize and --worker separately for each GPU, e.g. for two GPUs (--gpu 0 --gpu 1
) you can add --batchsize 12 --batchsize 16 --worker 3 --worker 2
, etc. The -t
parameter has no effect with this branch; the number of threads is simply the sum of worker threads over all GPUs.
Looks very promising! I will look into it during the weekend.
By the way with so many readouts is there a way to increase exploration?
Am 26.02.2019 um 03:43 schrieb Junyan Xu [email protected]:
You may now try https://github.com/alreadydone/lz/tree/tensor-accum-dev+. Tested on Google Cloud: 15270 pos/s with 4xV100, 256x19 net, and command ./leelaz -w ../../990.gz --batchsize 12 --gpu 0 --gpu 1 --gpu 2 --gpu 3 --benchmark -v 200000 --worker 4
38865 n/s, 27054 pos/s with 8xV100, 256x19 net, and command ./leelaz --gpu 0 --gpu 1 --gpu 2 --gpu 3 --gpu 4 --gpu 5 --gpu 6 --gpu 7 --worker 3 --batchsize 32 --benchmark -v 200000 -w ../../990.gz
You can specify --batchsize and --worker separately for each GPU, e.g. for two GPUs (--gpu 0 --gpu 1) you can add --batchsize 12 --batchsize 16 --worker 3 --worker 2, etc. The -t parameter has no effect with this branch; the number of threads is simply the sum of worker threads over all GPUs.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
A bug has been fixed in the tensor-accum-dev+ approach.
An experimental branch that gradually push policy towards uniform as visits increase to widen the search and help finding blind spots is https://github.com/alreadydone/lz/tree/tensor-accum-uniform (based on tensor-accum-dev+).
Two parameters are added: When a position's visit count reaches the value --uniform-visits
(defaulted to 1,000,000), all moves will be considered equally in terms of policy. Below the value, the policy gradually drifts towards uniform as visits accrue. The parameter --exponent
(defaulted to 1) controls how fast the policy drifts. Exponent 0 means the policy doesn't drift at all and always stays uniform.
To recover original behavior, set --uniform-visits to a very large number, and leave --exponent untouched.
This is inspired by some recent discussions, e.g. at https://github.com/LeelaChessZero/lc0/issues/743
@alreadydone That's really nice! Progressive squashing even better than any of my fix formulas... No later than this morning, I pushed 100k playouts on empty boardwith LZ200 net, on my old PC CPU only, just to find, after a long while, that only 4-4 and 3-4 had got visits. Your fix will definitively help. Thank you. Will learn how to compile so that I can play with it.
So I tried tensor-accum-uniform. There is no need for #include
For benchmarking I start leelaz and send "genmove B". I tried two different sets of parameters, without using --uniform-visits and --exponent : A) ./tau_leelaz -w best-network.gz -t 64 --gpu 0 --gpu 1 --gpu 2 --gpu 3 --gpu 4 --gpu 5 --gpu 6 --gpu 7 --worker 8 --batchsize 64
B) ./tau_leelaz -w best-network.gz -t 64 --gpu 0 --gpu 1 --gpu 2 --gpu 3 --gpu 4 --gpu 5 --gpu 6 --gpu 7 --worker 3 --batchsize 32
The first game with A) started with B playing Tengen(K10). Quite interesting to say the least. Second game with A) also started with B playing Tengen(K10) and White likes to play 5-4 first and then enclose the corner with 3-4
First and second game with B) looked normal, same opening the current nets like to play. All 4-4 points and 6-3 approach. Later double approach.
With leela#207 40x256 I get with A) for the first genmove B ca. 25000-27000ns. For B) I get for the first genmove B ca 21000-24000ns.
What confuses me a little bit is the GPU utilization. During the first genmove B "nvidia-smi -l" shows following util. 0%/14%/43%/0%/28%/13%/0%/44%/ (just an example, but I tested this a couple of times and only some GPUs are utilized others stay at 0%. Maybe because of bad timing in the beginning by nvidia-smi.) After issuing the following commands and waiting until they finished: genmove B, genmove W, genmove B, genmove W, genmove B the utilization for all GPUs jumps to 99% and stays there, even without issuing any further commands. Could it be because of pondering?
Sometimes when exiting leelaz with "exit" it throws a segmentation fault(core dumped).
All in all it looks very promising (1.4x improvement), but Tengen makes me a bit skeptical ; )
- The uniform branch defaults --uniform-visits to 1,000,000. If you want to recover original search behavior, use for example
--uniform-visits 10000000000000
. - I haven't observed tengen being played on the first move, and it's definitely strange to see # 207 play it. It's probably caused by a large number of concurrent threads making the search very wide at every level of the tree and uniformization of policy. With uniform-visits as above, maybe the engine won't play tengen even with 64x8 threads.
- Look at pos/s to benchmark performance instead of n/s. pos/s is the amount of positions actually processed by the GPUs, while n/s includes positions in the cache and from symmetry.
- It's not recommended to use the empty board to test performance. The n/s value will be boosted because there are 700% free playouts due to 8-fold symmetry. The pos/s value on the other hand will be dragged down because the search can't find enough unevaluated positions to feed the GPUs, and there's probably contention when accessing NNCache, which is mutex protected. That's probably why GPU utilization is low at the first move but full after four moves. However, some GPUs' utilization staying at 0 still surprise me. Instead, use --benchmark, which uses an asymmetric position three moves into the game, or load a sgf into midgame and genmove from there. In general, higher batchsize and worker lead to higher pos/s, but when you are able to saturate GPUs or achieve maximum pos/s in such normal positions, it's not recommended to increase --batchsize and --worker further. Among your A) and B), --worker 3 --batchsize 32 is the more reasonable one, though I think batchsize can be decreased further.
After issuing the following commands ... the utilization for all GPUs jumps to 99% and stays there, even without issuing any further commands. Could it be because of pondering?
-
Yeah these branches will keep pondering if --noponder is not set unless you issue the command
stop
orname
(just hitting a key won't stop it). However, some people told me that --noponder doesn't work and I'm yet to confirm this bug. -
All threads should be joined when exiting, so segmentation fault is unexpected.
-
Thanks for testing, and I'll keep an eye on identified issues when I test.
-
Added notice 4/29/2019: The "fractional backup" feature causes the displayed visits to be lower than the playouts (much lower if batchsize or gpu is large); can be disabled with
--disable-frac-backup
.
I experimented a little more. It seems that the uniform branch really finds some moves normal leela (0.16) does not find. But it still takes quite sometime before the optimal move is really considered and further investigated. I do not know the specifics but recent discussion about LCB makes me wonder if LCB+uniform would improve perfomance even more? Could LCB be easily combined with uniform? Or maybe you already did...?
Just pushed https://github.com/alreadydone/lz/tree/tensor-accum-uniform-0.17 https://github.com/alreadydone/lz/tree/tensor-accum-0.17 was pushed a few days ago. These have official 0.17 release merged in, including LCB.
Thank you for the update. The new version with 0.17 seems to have some problem because gpu util is only always around 30-40%. Before gpu util was 80-99%. I used --worker 3 --batchsize 32 and also tried lower batchsizes but gpu never goes higher than ~30%. Do I have to adjust the parameters for 0.17?
@Umsturz Thanks for the report. The problem is now fixed. In the earlier verison, the engine doesn't read batchsize from command line and set it equal to 1 always, due to some glitch in merging.