tb
tb copied to clipboard
7-men generation
New thread for discussing 7-men generation.
The generation looks fine so far, as for discussion, I wish to follow the previous memory bandwidth problem of mine, I have observed via "perf top" and these are hot-spots during iteration process.
It would no longer cause slowdowns when I use less threads, but this makes me wonder how did people pull this off a few years earlier, which I could assume a magnitude slower interconnect was required to aggregate physical memory from multiple machines in order to hold the intermediate table: even if a space-optimized indexing scheme was used, it still cannot justify the performance loss between UPI vs interconnect.
Any insights?
What slows down your system the most must be the inter-node accesses. An improved algorithm might "pin" KK slices to nodes in a clever way and somehow streamline the accesses across KK slices, in particular those across nodes.
It might already help a lot to control what nodes are working on what parts of the table. Then inter-node accesses can be limited to looking up king moves. With 8 nodes, there only need to be inter-node accesses for moves of the white king. As it is now, moves of essentially all pieces can trigger inter-node accesses because a thread will usually be working on non-local memory.
The Lomonosov generator runs on a cluster with very many nodes and 8 cores per node. I suppose each node holds at least one KK slice and inter-node accesses are somehow queued and performed in batches. Some links: https://plus.google.com/100454521496393505718/posts/R2pxYGaW4JA https://plus.google.com/100454521496393505718/posts/WfM5ARPec7f https://plus.google.com/100454521496393505718/posts/DvmgwVHdLqD https://plus.google.com/100454521496393505718/posts/7cAF7dweRmn
I think the Lomonosov generator uses 2 bytes per position during generation. That avoids the need for reduction steps (which are more painful when generating DTM tables).
Earlier approaches used less RAM and a lot more time.
I think it does not make sense not to make the generator "NUMA aware" during generation, so I will start working on it.
How much time does a permutation step take on your machine? I think it is theoretically possible to speed this phase up by looping in the right way, but it seems very tricky to implement that in a general way (I haven't spend much thought on it yet).
The compression algorithm does many passes over the data, but it is very sequential so it should scale quite well to a big machine. Making it NUMA aware should still help, but it might not be crucial.
The other thing of course is to take into account same-piece symmetry.
Permutation step of DTZ usually takes about 5 minutes on a 4v3 table.
Time usage example:
Generating KQQBvKQQ Found 531 tablebases. number of threads = 64 Initialising broken positions. time taken = 0:22.907 <- this can go very long if I raise number of threads Calculating white captures. time taken = 0:30.608 time taken = 1:02.598 time taken = 2:30.105 <- this can go very long if I raise number of threads Calculating black captures. time taken = 0:14.264 time taken = 0:47.799 time taken = 1:04.080 time taken = 0:58.022 <- this can go very long if I raise number of threads Calculating mate positions. time taken = 1:33.258 <- this can go very long if I raise number of threads Iteration 1 ...
The most time consuming steps are the first few iterations(Iteration 1~3), first saving for reduction if it reaches "Iteration 62...", it can take some 30 minutes, following saves are usually faster. Compression iteration steps are indeed pretty quick.
I am using interleaving memory allocation and not binding threads(let the OS re-balance automatically), it turns out to be the fastest way. Excessive threads just made everything significantly slower thus the symptom looked like the OS migration was the killer, which it wasn't.
I have now created a numa branch. Only rtbgen works for the moment. This might allow you to run the generator with many more threads without slowing down. It supports 2, 4 and 8 nodes.
I think you will have to disable "node interleaving" in the bios. (If I understand you correctly, you have now enabled it.)
I will test it in a moment.
In BIOS, there is no such option that controls interleaving memory across NUMA nodes(only per socket interleaving if I read correctly). Manual Pg.77
This behavior was controlled by invoking via "numactl --interleave=all" on the OS side, by observing "numactl -H" I can see interleaved allocation across nodes and eventually gets migrated to busy nodes by the OS.
So I will run it without memory interleaving and/or disable NUMA re-balancer from the OS and see how it goes.
Ah I see, then no need for a reboot. I think some mainboards have an option to just interleave all memory over all nodes and to hide the NUMAness from the OS.
It might still help to run with numactl --interleave=all, for example if the numa-aware rtbgen now does worse in the permutation and compression steps.
Oh I forgot to say that you need to add the "-n" option to enable NUMA. And to compile, you will need to have libnuma and its headers installed.
I have figured them out. And good news, it does not cause any slowdowns.
Some timings:
Generating KQQBvKRB Found 536 tablebases. Number of threads = 192 Number of NUMA nodes = 8 Initialising broken positions. time taken = 3:38.691 Calculating white captures. time taken = 0:04.877 time taken = 0:19.015 time taken = 0:18.662 Calculating black captures. time taken = 0:03.883 time taken = 0:10.985 time taken = 0:14.552 time taken = 0:21.434 Calculating mate positions. time taken = 0:19.627 Iteration 1 ... done. Iteration 2 ...
One exception is the first broken init step, during which the OS tries to migrate memory pages to appropriate nodes, upgrading them to huge pages and/or parallel disk access, but I think it is not worth optimizing. Observing "numactl -H" shows about even memory distribution across nodes, so “numactl --interleave=all” would not be necessary anymore.
Also I have measured memory bandwidth and latency under load, local: 106GB/s(87ns), remote: 16GB/s(160ns~220ns).
So this brings about a very big performance improvement. Cheers!
Blazing fast confirmed, even with HT on.
Generating KQQBvKRB Found 536 tablebases. Number of threads = 384 Number of NUMA nodes = 8 Initialising broken positions. time taken = 0:07.155 Calculating white captures. time taken = 0:03.686 time taken = 0:15.008 time taken = 0:14.650 Calculating black captures. time taken = 0:03.026 time taken = 0:07.907 time taken = 0:11.374 time taken = 0:18.956 Calculating mate positions. time taken = 0:19.196 Iteration 1 ...
I solved the slow first step problem by "cat *.rtbw > /dev/null".
Excellent! So now I will have a look at rtbgenp.
The slow initialisation reminds me of a problem that Cfish had on NUMA machines when it used numa_alloc_interleaved() for allocating the TT. I changed that to letting each node fault in a part of the TT. It seems that numa_tonode_memory() also makes the initial allocation slow. I could try the same as I did for Cfish, but I wonder if the kernel might then start migrating pages between nodes.
Btw, please check that the generated tables are correct, just to make sure I made no mistake. You could test this with a 6-piece table, for example.
That 7 second init was from a fresh reboot + *.rtbw in system cache, so I guess it is not a problem now.
Tested a 6-piece table with numa branch, generator stats are correct, compressed files are correct, but compression iterations became very slow.
I overlooked the 7s initialisation of the second run. I guess the first time your system needed time to free up memory. Maybe it would have been sufficient to do echo 1 >/proc/sys/vm/drop_caches.
For speeding up the capture calculations it might help to precache the relevant *.rtbw files. It would also be possible to force them into memory when they are first mmap()ed by adding MAP_POPULATE to to the flags in map_file() (util.c). This might be more important for 7-piece pawnful generation.
Did compression get slower while using the -d option? With -d, compression takes place in the memory originally allocated for generating the table. This memory is now pinned to nodes in a non-random way. It is probably bad if all threads are working on memory located at the same node.
This could probably be improved by letting the threads do the compression passes in a sort of random order instead of linearly from start to end.
Even better might be to rebind the memory to the nodes and let each thread work on local memory. But it might be too expensive to migrate all those pages.
Last checking run was not using the -d option, with it there is the same slowdown(perf says in replace_pairs_u8). It wasn't slow before because I never get this far to the compression step with this many threads. Maybe it is because in a 6-piece table memory is too small to split among too many threads? I can try compress a 7-piece table and see what happens.
The situation on the other machine that is building pawnful tables is rather different, there it has zero slowdowns without NUMA partitioning running with all cores(4 nodes), I can't figure out why.
How many threads are you using on the other machine?
88 threads. E5-4669v4, actually running with 176 threads works just fine too, where in the newer one it suffers on more than 64 threads no matter how I fiddle, unless running the NUMA branch. P.S. I just checked price, that v4 is more expansive than Scalable, that might explain.
And following the previous compression slowdown, it is the same with a 7-piece table, this memory is allocated anew.
I'm wrapping "run_threaded" with a thread limit parameter, then skip the while loop in "worker" of threads beyond limit, only the compression step should apply, results seems good and tables built are correct.
Generation now is about 10x faster, permutation 5x(5min -> 1min), compression remain the same because I limited threads.
10x faster is very good! That permutation becomes so fast (because of more threads, I guess) is surprising. It moves around a lot of data. Compression will hopefully improve with changes I am working on now.
Improving pawnful generation will be a bit more tricky. I think I will first NUMA-ify generation with 1 pawn and then use the same technique as in the ppppp branch for multiple pawns.
I may also have a look at the reduction step.
I don't know if the v4 is supposed to have higher bandwidth than the newer generation, but it makes sense that being NUMA aware is more important on 8 nodes than on 4 nodes.
Stats verify failed for KQQNvKQB after reconstruction, comparing with master to find out why. It is my thread limit patch caused it probably, I have it fixed and no tables affected.
Strange, the way you described it it sounded OK.
See my reference PR, the first version can skip other NUMA queues entirely. I'm regenerating with the third version to be sure.
I made some changes, which I have tried to explain in the commit message. I have not taken into account your PR yet (except for fix_closs, I think).
When running without -d, the generator now makes the permutation step NUMA aware. This also allows the compression step to be NUMA aware.
The reason the compression step is slow with many threads might be that big arrays are being accessed that are in remote array. It shouldn't be too difficult to allocate those arrays per node, but I have not worked on that yet. I will study your PR later to see if this could be the explanation.
For now, you could re-implement the changes in your PR by changing HIGH to LOW where necessary.
Compression with high threads works better than before, less congestion, but still a slowdown, so does tramsform.
It should be problem free. Tables KQQNvKQR and KQQNvKQB are successfully built.
OK, then I will move those big arrays to local memory for each node. Whether it makes a big difference or not, it cannot hurt.
I can set them to LOW threads while LOW can use a bit more than 64 threads without slowdowns, that's good enough since those steps won't take a long time anyway. This is just from a poor/unbalanced configuration of hardware with awful backfires. If there is no backfire, the OS is capable of migrating threads and/or memory across nodes to reduce NUMA latency as it should be.
KQRRvKRN still has some showdowns with 384 threads in black WDL permutation, but it takes only 2 minutes if I use 64 threads.