KataGo
KataGo copied to clipboard
The rate of weight files
The rate of "weight category" winners (https://lifein19x19.com/viewtopic.php?f=18&t=13322&p=261700#p261700): [font=Andale Mono] "bantamweight"------<|= 2 ^ 23 B (< 12 MiB)-- g170e-b10c128-s1141046784-d204142634.bin----(6); "featherweight"-----2 ^ 24 B (12 - 24 MiB)--- [i]I haven't[/i]; "lightweight"-------2 ^ 25 B (24 - 48 MiB)--- g170e-b15c192-s1672170752-d466197061.bin----(5); "welterweight"------2 ^ 26 B (48 - 96 MiB)--- g170e-b20c256x2-s5303129600-d1228401921.bin-(4); "middleweight"------2 ^ 27 B (96 - 192 MiB)-- g170-b40c256x2-s5095420928-d1229425124.bin--(2); "light heavyweight"-2 ^ 28 B (192 - 384 MiB)- g170-b30c320x2-s4824661760-d1229536699.bin--(1); "heavyweight"-------2 ^ 29 B (384 - 768 MiB)- g170e-b40c384x2-s2348692992-d1229892979.bin-(3); "super heavyweight"->|= 2 ^ 30 B (> 768 MiB)- [i]I haven't[/i].[/font]
The rate of "weight category" winners:
"bantamweight" <|= 2 ^ 23 B (< 12 MiB) - g170e-b10c128-s1141046784-d204142634.bin (6); "featherweight" 2 ^ 24 B (12 - 24 MiB) - is absent; "lightweight" 2 ^ 25 B (24 - 48 MiB) - g170e-b15c192-s1672170752-d466197061.bin (5); "welterweight" 2 ^ 26 B (48 - 96 MiB) - g170e-b20c256x2-s5303129600-d1228401921.bin (4); "middleweight" 2 ^ 27 B (96 - 192 MiB) - g170-b40c256x2-s5095420928-d1229425124.bin (2); "light heavyweight" 2 ^ 28 B (192 - 384 MiB) - g170-b30c320x2-s4824661760-d1229536699.bin (1); "heavyweight" 2 ^ 29 B (384 - 768 MiB) - g170e-b40c384x2-s2348692992-d1229892979.bin (3); "super heavyweight">|= 2 ^ 30 B (> 768 MiB) - is absent.
From early testing, I recall that that g170-b40c256x2-s5095420928-d1229425124.bin should be stronger than g170-b30c320x2-s4824661760-d1229536699.bin even at equal playouts. It's both faster and smaller and stronger, there's no reason to use the 30-block network.
If you found the 30 block network as stronger, that might be due to playing too few games to accurately measure.
May be it's because the cache was bigger (6,2 and 6,4 GB memory usage during playing)? Anyway they are about the same in memory usage, speed and strength (see from http://rugo.ru/read.php?2,68490,8034238,page=21#msg-8034233)...
Katago v1.7.0 isn't stronger, than v1.6.1 (https://lifein19x19.com/viewtopic.php?f=18&t=13322&p=261980#p261980).
You are right, because v1.7.0 does not add any new features that affect strength, it mostly only adds things like CUDA 11 support and changes to the interface that make analysis tools much more flexible.
However if I understand your link correctly, you are right only by accident, because the number of games you've played is massively far too few to be reliable in determining that one version isn't stronger than another. Please be wary of drawing conclusions from such tiny numbers of games. For example in the past, I have absolutely have had genuinely stronger versions of a bot be losing more than winning even after 50 or 100 games, simply due to getting unlucky with statistical noise, before hundreds and even thousands more games finally showed (almost certainly) that the true winning chance was actually greater than 50% and that the version that was losing more at first was truly the stronger version.
Which of course means that even smaller numbers of games like 8 games, or 4 games, are definitely too few to make any reliable determinations of strength, beyond establishing that neither side is completely outright buggy and disastrously weaker (if the result wasn't entirely lopsided).
May it become Zero-like engine stronger by the other way, than neuronet training and computing process optimizing in general?
KataGo wins LeelaZero in all categories (https://lifein19x19.com/viewtopic.php?f=18&t=13322&p=263775#p263775).
KataGo has been stronger than LZ for a while now across the board, so that seems consistent, but you should be aware that the series of tests that that user has been posting are usually too few games to make confident statistical judgments, and their results have been noisy or had minor misrankings in the past as a result. In general please disregard those tests and pay attention to other sources instead, unless something has changed recently in their testing methodology.
Oops, accidentally hit the close button.
OK. And what do You think about tests from other sources with big number of games and too few number of visits per move?
Yes, those tests are often more reliable. You don't want visits to be too small either, but usually if you have to pick, you should err on the side of having more games. For example, if you have to pick 800 visits per move in order to be able to afford 400 games, usually that's better than to use 8000 visits per move and only test 40 games.
The reason is that once you are above some minimum basic level of visits where threading and discretization effects matter a lot (so, ideally you have at least several hundred visits per move), it is uncommon for even a 10x visits difference to change the relative ranking of different networks or engines. It does happen, and certainly the magnitude of Elo gaps will change, but not often and not by a lot. For bots that would otherwise be close, I'd be surprised if relative Elos commonly changed even by as much as 50 points.
Whereas, for example, at 40 games, your 95% confidence intervals will be very wide, larger than 100 points. With so few games, you can't even tell apart those differences with reliability anyways!
Lastly, as a minor nitpick: added playouts per move is probably a more fair thing to hold constant than visits per move,. Because visits per move penalizes bots that are configured to search more sharply and therefore get more tree reuse, in a way that doesn't match computation time.
Anyways, for people who don't have enough compute power to run things like 800 playouts per move * 400 games, or prefer to run smaller numbers of games just for fun, it's still fine to test things! Just be clear when you advertise your results that they are only anecdotal/suggestive/uncertain, unless you got a really extreme result, like winning 17:3 or something like that. And really, unless you're trying to compile a ranking list or trying to present your results in a way that people would mistake as official, anyone should feel free to play with things and have fun! I really enjoy some of the games And has posted here https://lifein19x19.com/viewtopic.php?f=18&t=17944&start=60, and Maharani has posted here: https://forums.online-go.com/t/katago-7-komi-self-play-games/25142/88.
OK. If end user plays only several dozen games in a certain values range of time*performance, how is it important for him the difference in strength of compared engines (networks), which can only be defined after hundreds of games as a minimum (and in another mentioned range)? I think, if this difference does not appear in a few games, it cannot have a practical interest and is just for fun for such user (not for developer). Am I right?
Well, all of this is just for fun. It's interesting to see how far we can converge to optimal, and what new moves and amazing tesuji we can discover along the way. Is that practical? I don't know. It's certainly fun! And even though individual strength improvements require a lot of games to measure, if you can gradually accumulate many of them together, you can build up to very large differences - e.g. KataGo winning 90+% of games versus ELF is the accumulation of a lot of individual small improvements over time, each of which might have been harder to measure on its own. So it matters, added up over time.
The other thing is that there is a BIG difference between: A. The number of games you need for a given strength difference to have a good chance to make a difference and be of practical interest. B. The number of games you need for a given strength difference so that confidently or almost surely the stronger side wins. C. The number of games you need to highly confidently accurately measure the strength difference.
At each step, the number of games multiplies by a lot.
For example, suppose X wins against Y with odds of 3:2. That's an important difference, right? Even if you just play 5 or 10 games, X has a practical advantage that matters, even though Y will still win some of the time. So A is achieved with just a few games.
How many games do you need in a match so that 95% of the time, X will win more than half the games? It turns out you need 67 games. And actually 95% is not extremely confident. A chance of 5% definitely still happens, and 5% to simply be wrong is not ideal if you or others are making long-term decisions based on it. Ideally to be more sure, you'd have probably at least 100 games to achieve B and be confident in the result, instead of the weaker side getting lucky.
How many games do you need to know that it is 3 : 2, and distinguish it from 2.5 : 2 or 3.5 : 2 or other values? Well, even if run enough games to be 99% confident that X wins, it might be only 70% that X wins at a ratio between 2.5 : 2 and 3.5 : 2. So even if X almost surely wins, due to luck the amount by which X wins can be very different than the truth of 3 : 2. So to accurately measure and satisfy C, you need even more games.
Obviously you will also need more games if the true strength difference is smaller than 3 : 2, which is commonly the case in testing.
If you make a confident claim that "X is stronger than Y" based on some test games, that requires that you ran enough games so that the weaker side could not have won by chance, so you need to achieve at least B above, even though the number of games where it practically matters - A - could be a lot less.
Yes, I agree with You fully: this hundreds of games statistics is important for developers. But for us (end users) is important only A statistics. All human world championship matches (not only in Go) had few (maximum few dozen) games. And by this logic for general public the matches between "superhuman" engines (neuronet weights) must have not bigger number of games, the same as in human case time control for matching capabilities and be started on PCs (let modern high end) for a sense of accessibility.
On the other hand, in topic http://rugo.ru/read.php?2,68490 was written, that in score case of >|=2 for both sides the engines are practically equivalent, case of 3:1 - first most likely not weaker than second, case of 4:0 - first most likely stronger than second...
Despite "A lot of internal changes with hopefully all the critical changes needed to support public contributions for the distributed run opening shortly, as well as many bugfixes, and stronger search logic. New subtree value bias correction method has been added to the search, which should be worth somewhere between 20 and 50 Elo for mid-thousands of playouts. Fixed a bug in LCB move selection that prevented LCB from acting on the top-policy move. The fix is worth perhaps around 10 Elo. Root symmetry sampling now samples without replacement instead of with replacement, and is capped at 8, the total number of possible symmetries, instead of 16." v.1.8.0-v.1.7.0: 18-18 (https://lifein19x19.com/viewtopic.php?f=18&t=13322&p=263844#p263844)...
May be in case of AI the only way of game strengthening is the neuronet upgrading?
No, v1.8.0 should better than v1.7.0, a little.
Think of it this way: any time you run a "A" statistics match, unless the difference in strength is very large, it is like rolling a dice, and if you roll a 3,4,5,6, you get a noisy version of the truth, and if you get a 1 or a 2, you get unlucky and the match lies to you and you get the WRONG result. The WEAKER side wins the match due to luck, or it's a draw despite there being a true important strength gap.
Let me emphasize again, even if end users only care about "A" matches, you should NOT use "A" matches to make strong conclusions. Therefore, things like this:
May be in case of AI the only way of game strengthening is the neuronet upgrading?
Are not things you can conclude. In an 18-18 result, the only thing you can conclude is v1.8.0 is not enormously weaker than v1.7.0. But in truth, it is stronger for most conditions even though the difference is small, and if we make ten more changes that gain exactly the same amount, the difference will add up to be very very noticeable.
It is NOT a strong conclusion for developer opinion. It's a practical conclusion for end user opinion, that with new version downloading user doesn't get any noticeable difference in strength. So user may download new version without fear of enormously weakening the game as a result of side effects from changes made and get the advantages of bug fixes made. But, on the other hand, user doesn't get any noticeable for him game strengthening by new version downloading...
For example, with score 15:15 (https://lifein19x19.com/viewtopic.php?f=18&t=13322&p=267839#p267839) KataGo v.1.9.1 isn't more strongly than v.1.8.0 for end user opinion...
I think, so small difference in score (19 - 17 see: https://lifein19x19.com/viewtopic.php?f=18&t=13322&p=269755#p269755) doesn't prove (for end user), that new version is stronger...
New weight files are stronger for end user opinion (unlike new engine versions): https://lifein19x19.com/viewtopic.php?f=18&t=13322&p=270899#p270899. If a year ago the "light heavyweight" category was the strongest (like it is with LeelaZero or SAI), at the end of 2021 year it lost the lead because only "middleweight" and "heavyweight" files were trained that year.
I'm glad you're enthusiastic, but I still don't understand why you insist on using such a tiny number of games (only 4 per network!!) and justifying it on the basis of wanting to serve "end users".
If that is the only computation power you can afford, sure. I absolutely respect and appreciate doing the best one can with limited resources. No problem! :)
But instead if it's a deliberate choice to use fewer games to better match what end users would experience, then it's silly. Rather than deliberately using an error-prone measurement because you think most users will not notice, it's certainly at least no harm to use an accurate measurement (more games) and report the accurate difference. Then each user can decide for themselves if the accurately-reported difference is big enough to care about.
Four games per test is especially few. Consider a bot A that beats B 60% of the time. I would guess most people would consider that not a huge difference, but still a respectable one. However, with only 4 games, the chance that B beats A 3-1 or 4-0 is about 18%! So there is an 18% chance you'd come up with the entirely backwards conclusion.
You've argued many times in the past that "end users" will only use the bot for few games themselves, therefore the way to make the best recommendation is to test using only a few games because it better matches the usage, rather than tests with a large number of games. We can see by the following example that such logic isn't very good:
- Suppose we did do a 4 game test and we did get a 3-1 result in favor of B (getting a result that was only 18% likely is very possible!).
- Suppose we also did a 1000 game test and this time, the result was that A won 613 games and B won 387 games.
Consider a user who plans to use either bot A or bot B in a tournament where it will play 4 games, and they want the bot with the best chance of doing well. Based on the above two tests, which bot should we recommend to them? Should we trust the 4 game test and recommend B because the tournament will also be 4 games, therefore a 4-game test is the most reliable? Our should we trust the 1000 game test and recommend A because the 1000 game test is overall more accurate measurement?
Obviously we should recommend bot A to them!
We can see here a clear demonstration that the principle "if end users will only notice larger differences and will only be using the bot for a very few games, then the best way to make a good recommendation to to also run tests using only a very few games" is a bad principle. The way to make a good recommendation to an end user that will run few games is to use many times more games than they will use.
Because on the end user opinion (unlike on developers one) if it is possible to determinate the stronger of the 2 engines (or weight files) only after a dozens of games then these engines are of almost equal strength (end user doesn't play dozens of games continuously). For developers, of course, this difference is significant, because it shows the right direction in strengthening the game. For the end user the main thing is that the tests are correct, that is, the resources in them are allocated strictly equally, and their use corresponded to that in real games (unlike the developers tests with their miner time sets). Most users can not evaluate the accuracy of the developers' tests and, respectively, decide for themselves if this difference is big enough to care about. But for them it is enough simply just define the following: 4:4 - almost equivalent; 3:1 - the first is not weaker, most likely; 4:0 - the first is stronger, most likely. 60% is significant for developers, but for end users is significant ~90%. Can You evaluate the chance that weaker beats stronger with score 4:0 in this case?
"End user" is a user, that plays with engines with classical time sets to increase his own level of play and learn skill in playing Go in general (not that uses engines like bots in tournaments). No one argues that more games means more accurate results. But the question in testing practice is the next: "How to get the maximum accuracy for a given time?". In this case "more games" means "miner time sets" that can lead to the entirely backwards conclusion, when the value of "time set multiply by performance" is significantly different from that uses end user.
KataGo v.1.10.0-v.1.11.0 with weight files of the strongest "weight categories" half: 11-5 (see "details" on https://lifein19x19.com/viewtopic.php?f=18&t=13322&p=272356#p272356)...
After 2 years training "heavyweight" weight file became a bit stronger (12-8) than "light heavyweight" one (details).
The score of KataGo v1.12.4-v1.11.0 (eigen) with the stronger weight files half: 6-8 (see "details").
New weight file (b18c384nbt-uec.bin) is the strongest (details).
I had recieve message from here on my E-mail:
Have you ever compared the Elo rating of the b18c384 weight and that of the weight which defeated ELF OpenGo noted in Tony J. Wu’s KataGo paper? It might be more informative to show how far this new weight is stronger than that old superhuman weight.
I don't know what version of ELF was in mentioned paper, but converted for LeelaZero ELF version was the strongest in 2018 year last time. After that the own LeelaZero weights of the same "category" became stronger than ELF one (details). And the KataGo weights are much stronger than LeelaZero one (details)...
Version 1.13.0 is stronger than previous. But all it victorious sparring were with b(15-20)... weight files only...
Weight file b18c384nbt-optimisticv13-s5971M.bin is significantly (13-7) stronger than b18c384nbt-uec.bin (details).
KataGo weight files of "heavyweight category" didn't become stronger in 2023 year (details).