Job queue issues
@yvanoers proposed changes to the job queue in #11, I reverted these changes in release 1.0.2 after some local testing as it results in longer solves and an incomplete pareto front. Should look into this more thoroughly to find out why.
Things to figure out:
- Why is the current implementation faster?
- Are other processes not able to help out as described by @yvanoers?
- Why do the proposed changes lead to an incomplete pareto front?
Fascinating. In my initial testing I found a speed up compared to the current implementation. Though there were other changes included, I would not expect those to have a profound effect on performance.
In any case, @wouterbles, are you able to add a test case that tests for the completeness of the pareto front, perhaps? That would certainly help me with the investigation.
Interesting indeed. I have mainly been running the 3kp40/3kp50 cases. Which are in the tests folder.
Did some tests with both versions of the queueing. By the way, make sure to add the latest commit that sets the longest_q on top of your enqueue changes, otherwise I ran into issues. Results:
- With enqueue changes: Solved 1275 models for 358 unique Pareto solutions (should be 389)
- Without enqueue changes: Solved 868 models for 389 unique Pareto solutions
pytest tests/test_3kp40.py also fails with enqueue changes as it checks if the correct solutions are returned.
@yvanoers created a fix for this issue in the latest PR, this bug was introduced by a refactor a while ago. Thanks for discovering the issue! Instead of queuing each grid point individually (which you proposed), they are grouped in 'rows', which means processes will take up bigger chunks of work.
Nice work!
I ran a test with my original changes and it found only 358 solutions.
Then I set the cpu count to 1 (therefore disabling multiprocessing) and it found 389.
Next, reenabled multiprocessing but disabled jumping (commented continue in the if jump > 0 block of SolverProcess) and it also found 389.
So the algorithm needs the items it is processing in order and in the same process. When items get picked off the queue, it could jump over items it is not supposed to skip.
I don't understand the algorithm well enough to known whether your row-based approach does not suffer from this. The jump variable is not reset when a new row is dequeued, so if jump can span rows, this might still be a problem, just occurring less frequently.
That's definetely something to look into. I think it might make sense to reset the jump counter when taking a new block of work.