Stockfish
Stockfish copied to clipboard
Missing Unit Tests
Describe the issue
Am I something missing or are unit tests missing?
I was considering to do more pull requests, but without unit tests, I am afraid to break something. Further, how are we supposed to note regressions without unit tests?
Expected behavior
I would have expected a folder with unit tests with a modern frame work like Google Test and there would be a compile option to compile the project without unit tests.
Steps to reproduce
Open the project, go to the test folder and be aware there are no unit tests.
Anything else?
No response
Operating system
All
Stockfish version
9fe9ff00823a0d1e989fcfc45410396e94345987
testing is mostly done via fishtest, https://tests.stockfishchess.org/tests and our github ci and your local compiler (lols)
additionally you could compile in debug mode to see if you hit any asserts
This seems to be an integration test. Integration tests and unit tests are not the same. An integration test cannot replace unit tests and unit tests cannot replace integration tests.
And manual debugging is very error prone. One change can affect other parts which you didn't expect and without unit tests, refactoring is getting really hard which may result into not refactoring code. That is, the technical debt increases.
Yea, I was just refering to testing in general.
@StefanoD Can you give a concrete example of a unit test that would be useful in the context of a chess engine?
IMHO the difficulty is that it is not clear what constitutes a bug in a chess engine. For example in normal software, access to the hash table (a shared resource) should be protected by locks. In a chess engine however race conditions are tolerated while accessing the hash table as testing has shown that the overall effect on Elo is positive.
@vdbergh I. e. finding best moves, finding legal moves. In context of hash table: Testing for low collisions, testing for performance. This is all relevant, when you change code.
Or just simple things like parsing FEN strings or even printing information to the terminal when you refactor the code. For example, I don't like the fact, that some temporary buffers are created and then they are written with sprintf
and then streamed into a stringstream
. First, creating temporary buffers hits unnecessary performance, second, this is error prone, because of memory access violations. This is why I created this pull request. I could manage to test this manually with the godbolt online compiler, without running the whole program (I actually don't know how to use this program, because I'm new to it), but it gets much more complicated with other, similar code where I considered to refactor, too.
In short: The idea of unit tests are to test (almost) every function, such that regressions can be found instantly and without much pain.
@StefanoD Can you give a concrete example of a unit test that would be useful in the context of a chess engine?
IMHO the difficulty is that it is not clear what constitutes a bug in a chess engine. For example in normal software, access to the hash table (a shared resource) should be protected by locks. In a chess engine however race conditions are tolerated while accessing the hash table as testing has shown that the overall effect on Elo is positive.
verifiying move legality on a number of different test positions could be a unit test, or repetition testing ?
In context of hash table: Testing for low collisions, testing for performance
these are things I would say are good candidates to test on fishtest.
For example, I don't like the fact, that some temporary buffers are created and then they are written with
sprintf
and then streamed into astringstream
. First, creating temporary buffers hits unnecessary performance, second, this is error prone, because of memory access violations.
The mentioned sprintf
is totally regardless of performance impact for the first. eval
command is for debugging purposes and is not called over hundreds of times while SF do searching. Also declaring & defining local variables inside a function scope always let the variable placed in stack, not heap (indeed doesn't call allocator-related functions).
The patch you've submitted is actually more heavy + slow if not properly optimized by compilers, but let's not mention it here.
And next sprintf never causes memory violation once the buffer exists in memory. I don't know what reasons you're thinking about, but mostly printf
related security issues come from %n + FSB (which is forbidden in early 2000s), and %s with user-input, not terminated by a NULL (read overflow, memory leak); both cases are not present in SF code.
@StefanoD Legal move generation is typically tested with perft. Your other examples in the first paragraph are not convincing. One doesn't care about best moves in individual positions in chess engines (such a thing isn't even properly defined), only overall Elo counts (which is determined by testing on Fishtest). And the number of collisions in SF's hash table is quite high as testing has again shown that the overall Elo effect is positive.
I am not against unit testing but I think that in SF's context it might not be as useful as you think it is.
Something that would be suitable for unit testing is the table base code. But what would be really useful is a documentation of the file format.
@vdbergh I think you tagged the wrong person, I think you meant me?
I was actually talking more about Position::legal()
...
@vdbergh
And next sprintf never causes memory violation once the buffer exists in memory.
No, if the buffer is too small or a string not null-terminated, you'll get an access violation.
If I really have to convince you that not having unit tests, is bad, then I guess you have to deal with more regressions and less contributors.
@Disservin No @StefanoD also gave legal move generation as an example. I don't quite know what Position::legal()
does. If it is used in legal move generation then it will be checked by perft.
I guess a unit test could consist of running perft on a number positions with known perft values (the starting position might not be sufficient for obscure corner cases).
Position::legal()
is only run in our main search function call its used to verify that the ttmove is legal in the position, perft doesnt use it
No, if the buffer is too small or a string not null-terminated, you'll get an access violation.
char buffer[3][8];
std::memset(buffer, '\0', sizeof(buffer));
Have you checked the previous lines?
I agree with introducing unit tests btw, but you should provide a detailed list of scopes of which we need to test. As Disservin mentioned, SF is already being tested on a pretty decent testing framework (fishtest).
@Disservin Ok. Perhaps Position::legal()
only verifies that the move is pseudo legal?
@Disservin Ok. Perhaps Position::legal() only verifies that the move is pseudo legal?
nope, there are Position::pseudo_legal()
and Position::legal()
the two connected together should be true for any legal move and false for any other non legal move. You could test with trying all 2^16 (moves are 16bit large) combinations on a given position. Would at least shrink the realm of errors that could happen, however since stockfish has not crashed playing an illegal move we can assume they work as intended.
No, if the buffer is too small or a string not null-terminated, you'll get an access violation.
char buffer[3][8]; std::memset(buffer, '\0', sizeof(buffer));
Have you checked the previous lines?
Yes, I have and the function format_cp_aligned_dot()
which was called, relied that the buffer was correctly constructed from the caller. But you can call the function from other parts, too and the logic has to be reimplemented. And what if you change the formatting? Then you might need a different buffer size and adjust the buffer from every caller which is error prone.
As I said: Unit tests are also for testing regressions in case of refactorings.
I agree with introducing unit tests btw, but you should provide a detailed list of scopes of which we need to test. As Disservin mentioned, SF is already being tested on a pretty decent testing framework (fishtest).
As you know, I am absolutely no expert in this project. Thus, I thought you could name more scopes than me. But, if a project doesn't have unit tests then, for me, there is something not optimal. In my opinion, every project must absolutely have unit tests.
Of course, it's good that you have integration tests!
I tend to agree with @vdbergh and others.
Every project should not have unit tests per se, this is just way too dogmatic. It depends on the case. But anyway, if some more good examples are found for where unit tests add value, you might convince me :)
As I said: Unit tests are also for testing regressions in case of refactorings.
Fishtest tests for both nonregression of elo and gain of elo. In fact I'm tempted to say that your pull request, which I'm generally in favor of, should be put thru a nonregression test on fishtest just to be sure that it is sane.
How long do fishnet
and perft
tests take to run, and can newcomers run them locally after making an experimental code change?
Stockfish/src (master)$ time ../tests/perft.sh
perft testing started
perft testing OK
real 0m2.313s
user 0m0.016s
sys 0m0.020s
Fishtest is a distributed testing framework testing locally isnt the purpose of it. Perft tests finish in under 1 minute. Anyone can create a fishtest account btw
perft.sh
is only a small test. This could be also accomplished with a unit test framework. But a unit test framework invites developers to do other small tests. For example to test one single function after refactoring it. This is as easy as it can get.
@TheDarkKnightRise @Disservin I know but I thought that it is sufficient to check that the hash move is pseudo legal.
Since Stefano has a very clear idea of the kind of unit tests he/she would like, but has not proposed any unit code in the last six months, I'll close the issue as not urgent :-)
You can add unit tests for known bugs, like #4877. It's a good practice to add a failing unit test for every defect, and then make sure the test passes after the bugfix. The point is that the developers can demonstrate that they isolated what is wrong with the code, proved that it is wrong, fixed it, proved that the fix is correct. source.
You can add unit tests for known bugs, like #4877. It's a good practice to add a failing unit test for every defect, and then make sure the test passes after the bugfix. The point is that the developers can demonstrate that they isolated what is wrong with the code, proved that it is wrong, fixed it, proved that the fix is correct. source.
Well with some bugs like the one you mentioned is a basically impossible to unit test it. It depends a lot on what is currently done in search and what not. In theory these things should be caught in the CI and writing an unit test for undefined behavior is not really possible anyway… code can have undefined behavior but happen to give the right result when you are running the unit test.
~I'm not familiar with C++, so I don't know how it can be achieved in this language. But this case seems pretty standard. @Disservin, do you say that it's impossible to mock dependencies, including the std::clamp
(e.g. use the normal implementation, but throw exception if max value < min value), and then call the search
function with parameters that simulate the specific scenario in which std::clamp
would be called with max value smaller than min value? With this setup the unit test will fail in 100% of runs. Then apply the fix proposed in #4877 and the test will pass. Such tests will act as a safety net against regressions.~
upd: ok, chatgpt proposes wild solutions for mocking std functions, so let me agree that it's not that straightforward to unit test the bug that happens because of undefined behavior in C++ standard library.
tldr; unit testing search is not as straight forward as one might think
But this case seems pretty standard.
Well "unit testing" undefined behavior is not pretty standard to my knowledge. The closest thing which you can do is run the program with -fsanitize=undefined
, but that is not really related to any unit and rather testing the program as a whole then.
do you say that it's impossible to mock dependencies,
i'm not
including the std::clamp (e.g. use the normal implementation, but throw exception if max value < min value)
std::clamp is a function from the standard, so we cant change the underlying implemention (though my libcxx has an assertion in place __glibcxx_assert(!(__hi < __lo));
which one could enable, I guess).
and then call the search function with parameters that simulate the specific scenario in which std::clamp
this is easier said than done, search changes all time. So each new functional patch would be required to find a special state in which certain conditions are met which trigger the problematic behaviour or everything in search needs to be split up into dedicated functions which are then in turn unit testable. Though there are many parts which rely on some (global) state to trigger the exact behaviour.