Document load under fleet
Just to keep some numbers for later reference, this is the average time in ms needed server side when the fleet is present (but not overloading) the server (50k cores), as obtained from the worker output (Post request https://tests.stockfishchess.org:443/api/request_task handled in 1719.69ms (server: 265.24ms)):
failed_task 13.6646
request_spsa 4.04172
request_task 149.143
request_version 0.241336
update_task 4.36213
upload_pgn 92.8058
corresponding observation on the worker side:
failed_task 742.879
request_spsa 319.939
request_task 756.641
request_version 627.497
update_task 539.884
upload_pgn 1213.44
Server-side sampling of the traces, for calls not leading to a lock looks like:
21
File "rundb.py", line 432, in flush_buffers self.runs.replace_one(
7
File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info)
File "rundb.py", line 727, in request_task return self.sync_request_task(worker_info)
File "rundb.py", line 738, in sync_request_task w = self.workerdb.get_worker(my_name)
File "workerdb.py", line 17, in get_worker r = self.workers.find_one(
5
File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info)
File "rundb.py", line 727, in request_task return self.sync_request_task(worker_info)
File "rundb.py", line 902, in sync_request_task committed_games += (
4
File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info)
File "rundb.py", line 727, in request_task return self.sync_request_task(worker_info)
File "rundb.py", line 870, in sync_request_task if connections >= connections_limit:
3
File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info)
File "rundb.py", line 727, in request_task return self.sync_request_task(worker_info)
File "rundb.py", line 837, in sync_request_task task_name = worker_name(task["worker_info"], short=True)
File "util.py", line 47, in worker_name name = "{}-{}cores".format(username, cores)
Essentially all stack traces go via request_task.
And some api times of the server under load (IIRC 75k cores) but with various stages of 'distress' medium:
# logging from [08/May/2024:18:26:11 +0000] to [08/May/2024:18:27:30 +0000] ########
# duration (seconds) : 79
# calls in total : 10000
# calls per second : 126
# calls not reaching the backend: 1
# calls handled by the backend : 9999
# route calls total average minimum maximum
/api/beat 5523 2100.429 0.380 0.001 4.014
/api/update_task 3154 723.048 0.229 0.002 4.014
/api/request_task 376 138.410 0.368 0.038 3.648
/api/request_version 371 62.196 0.168 0.000 3.232
/tests/view/ 45 45.010 1.000 0.025 3.406
/api/upload_pgn 367 19.217 0.052 0.007 0.289
/tests/tasks/ 18 6.594 0.366 0.045 2.577
/tests 9 6.145 0.683 0.029 0.936
/api/request_spsa 42 3.253 0.077 0.002 0.363
/tests/user/ 2 1.172 0.586 0.540 0.632
/api/actions 28 0.840 0.030 0.003 0.161
/api/nn/ 59 0.625 0.011 0.003 0.059
/contributors 1 0.293 0.293 0.293 0.293
5 request_task / s
high:
# logging from [08/May/2024:18:38:01 +0000] to [08/May/2024:18:39:06 +0000] ########
# duration (seconds) : 65
# calls in total : 10000
# calls per second : 153
# calls not reaching the backend: 1
# calls handled by the backend : 9999
# route calls total average minimum maximum
/api/beat 4854 12774.715 2.632 0.002 24.684
/api/update_task 2673 4104.362 1.535 0.002 24.689
/api/request_version 1711 1581.257 0.924 0.001 23.757
/api/request_task 583 1036.156 1.777 0.042 24.179
/api/get_elo/ 7 29.748 4.250 0.108 24.629
/tests 10 8.301 0.830 0.040 1.138
/api/request_spsa 31 7.610 0.245 0.005 0.739
/api/upload_pgn 57 5.803 0.102 0.008 3.694
/tests/live_elo/ 4 3.639 0.910 0.223 2.038
/tests/user/ 1 0.517 0.517 0.517 0.517
/api/actions 23 0.412 0.018 0.002 0.061
/api/nn/ 41 0.288 0.007 0.003 0.047
/tests/view/ 1 0.261 0.261 0.261 0.261
/tests/finished 1 0.121 0.121 0.121 0.121
9 request_task / s
too much:
# logging from [08/May/2024:18:42:28 +0000] to [08/May/2024:18:43:19 +0000] ########
# duration (seconds) : 51
# calls in total : 10000
# calls per second : 196
# calls not reaching the backend:
# calls handled by the backend : 10000
# route calls total average minimum maximum
/api/beat 4662 23688.011 5.081 0.002 30.029
/api/update_task 4074 20350.370 4.995 0.003 30.031
/api/request_version 767 4114.809 5.365 0.003 29.990
/api/request_task 341 1665.680 4.885 0.071 29.987
/api/request_spsa 66 70.501 1.068 0.010 12.693
/api/failed_task 9 48.770 5.419 0.012 29.984
/tests/view/ 4 7.470 1.867 0.175 6.781
/tests 9 6.444 0.716 0.022 1.260
/tests/run 2 5.032 2.516 1.115 3.917
/api/upload_pgn 32 1.834 0.057 0.008 0.272
/tests/user/ 2 1.174 0.587 0.498 0.676
/api/actions 22 0.470 0.021 0.007 0.136
7 request_task / s
if we go with the above timing for request_task, we can handle 1000 / 149 ~ 7 of them per second, so that's consistent with the numbers we have here (without timeouts at various rate for the profile 2 and 3).
@vondele Can you explain why the internal throttling in Fishtest does nothing (at most five concurrent accesses to request_task)? I had the PR #1987 to set the number to 1 but I closed it since I did not understand what was going on.
no, I don't know why that happens. Admittedly, I don't quite know how the frameworks around our python code do things, if they would only allow one concurrent access to the same endpoint, it would be obvious.
https://github.com/official-stockfish/fishtest/blob/b204b37ec99e049481a1f3e955aaff23364c1775/server/fishtest/rundb.py#L681-L683 https://github.com/official-stockfish/fishtest/blob/b204b37ec99e049481a1f3e955aaff23364c1775/server/fishtest/rundb.py#L723-L732
It should queue only 4 requests and drop the other until a slot in the semaphore is released...
The entire thing is basically single threaded from the start because of how python does thing. I think Pasquale has worked around that by configuring the nginx so that certain endpoints are forwarded to different python processes (I think we got 2 running)? So all requests to the same endpoint are handled sequentially without any concurrency iirc
https://github.com/official-stockfish/fishtest/blob/b204b37ec99e049481a1f3e955aaff23364c1775/server/production.ini#L20-L36
Each of the 3 instances should have 4 threads, so the semaphore should never overtook.
PROD running with 10 threads per instance. I will check in the log for requests rejected.
PROD running with 10 threads per instance. I will check in the log for request rejected.
If Disservin is correct then the # of threads is irrelevant because of the GIL.
EDIT: that’s not quite right. Threads can switch between the execution of byte codes.
let me add some workers to check this, I need to pause the progression tests for that.
import threading
import time
class Test:
def __init__(self):
self.task_lock = threading.Lock()
self.task_semaphore = threading.Semaphore(4)
def access_resource(self, thread_id):
if self.task_semaphore.acquire(False):
try:
with self.task_lock:
print(f"Thread {thread_id} is accessing the resource.")
time.sleep(0.001)
finally:
self.task_semaphore.release()
print(f"Thread {thread_id} has released the resource.")
else:
print(f"Thread {thread_id} could not acquire the semaphore, resource is too busy.")
test = Test()
threads = []
for i in range(1000):
thread = threading.Thread(target=test.access_resource, args=(i,))
threads.append(thread)
thread.start()
# Wait for all threads to complete
for thread in threads:
thread.join()
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
May 13 21:23:56 tests.stockfishchess.org pserve[26408]: request_task too busy
just FYI, while the workers are ramping up, initial phase have the following server timings:
request_spsa 11.578
request_task 39.5495
request_version 0.349775
update_task 6.072
2-3min later:
failed_task 32.0517
request_spsa 9.76163
request_task 55.4572
request_version 0.594289
update_task 3.42941
about 9min later
failed_task 36.864
request_spsa 7.39204
request_task 61.0517
request_version 0.543516
update_task 3.66163
upload_pgn 15.84
22min later:
failed_task 37.8486
request_spsa 9.56024
request_task 72.8793
request_version 0.499205
update_task 10.0572
upload_pgn 23.6664
34min later:
failed_task 38.7616
request_spsa 8.77399
request_task 80.093
request_version 0.464707
update_task 9.20768
upload_pgn 27.9647
It is clear that request_task timing grows quickly with the server load: typical worker:
Post request https://tests.stockfishchess.org:443/api/request_task handled in 148.79ms (server: 14.41ms)
Post request https://tests.stockfishchess.org:443/api/request_task handled in 356.46ms (server: 197.66ms)
or
Post request https://tests.stockfishchess.org:443/api/request_task handled in 171.36ms (server: 13.02ms)
Post request https://tests.stockfishchess.org:443/api/request_task handled in 351.89ms (server: 184.89ms)
so in the ramp-up phase, with few workers active we see about 14ms for a call, later this is closer to 190ms for a call.
PROD running with 10 threads per instance. I will check in the log for request rejected.
If Disservin is correct then the # of threads is irrelevant because of the GIL.
EDIT: that’s not quite right. Threads can switch between the execution of byte codes.
Well I only mentioned the bits I’ve collected, also a first I’m seeing the prod ini with the threads so take everything with a grain of salt from me
It's the first proof that waitress threads are for real :)
as we've done the check, I'll take the workers down.
As shown in https://github.com/official-stockfish/fishtest/issues/2003#issuecomment-2108825497 it looks clear that the request_task api is getting increasingly expensive under load from roughly 15 to 190ms per call between the first call of a worker and the second call.
I think the number of connections in waitress should be equal to the number of threads (so that waitress does not queue connections) and then the value of the request_task semaphore should be adjusted so that Fishtest does not become overloaded.
For years we ran the server with 500 connection_limit and 16 threads, I didn't measure any improvement with 2000 machines 17000 cores in PROD, so I restored the waitress default 100 connection_limit and 4 threads in this https://github.com/official-stockfish/fishtest/commit/b960e2b383112f365212cd2547ca170c527dbfb3 https://docs.pylonsproject.org/projects/waitress/en/stable/arguments.html
In nginx we keep alive 64 connections for upstream to avoid the cost to close and open connections (not sure the 2 config are talking about the same "connection")
upstream backend_6543 {
server 127.0.0.1:6543;
keepalive 64;
}
The thing is that 100 is too big.
Waitress will simply queue connections without giving Fishtest a chance to reject them. That's why we are seeing 30s for api calls. They are just queued in waitress.
EDIT: I mean too big compared to the number of worker threads.
We set 30 second in nginx
location / {
rewrite ^/$ /tests permanent;
rewrite ^/tests/$ /tests permanent;
proxy_set_header Connection "";
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Host $host:$server_port;
proxy_set_header X-Forwarded-Port $server_port;
proxy_set_header X-Country-Code $region;
client_max_body_size 120m;
client_body_buffer_size 128k;
proxy_connect_timeout 5s;
proxy_send_timeout 30s;
proxy_read_timeout 30s; # timeout
proxy_max_temp_file_size 0;
proxy_redirect off;
proxy_http_version 1.1;
proxy_cache cache_zone;
gunzip on;
proxy_pass http://$backends;
location /api/serialize_run/ {
deny all;
}
}
Yes nginx caps the connection time. But either waitress or Fishtest should reject the connections. It should not be left to nginx.
It's really the job of nginx. We can set connection limit and rate for host, uri, binary_remote_addr, globally or even for any single path.
limit_conn_zone $host zone=perhost:10m;
limit_conn perhost 10;
limit_conn_status 429;
It's really the job of nginx. We can set limits for host, uri, even for any single path.
limit_conn_zone $host zone=perhost:10m; limit_conn perhost 10;
Well then nginx should limit the number of connections to waitress to the number of threads in Fishtest. Otherwise waitress will just queue the connections.
The point is we need to give Fishtest a chance to reject the request_task connections using the semaphore.
Note that Fishtest will reject the connection immediately. There will be no 30s timeout.
set 10 connections in nginx (like the waitress threads), tested opening 50 connections concurrently.
2024/05/14 00:52:57 [error] 13428#13428: *790 limiting connections by zone "per_host_conn", client: ***.***.***.***, server: tests.stockfishchess.org, request: "GET /api/active_runs HTTP/2.0", host: "tests.stockfishchess.org"
2024/05/14 00:52:57 [alert] 13429#13429: ignore long locked inactive cache entry 522ba7397b573b51acc828305a6d6a06, count:7
2024/05/14 00:53:59 [error] 13428#13428: *1075 limiting connections by zone "per_host_conn", client: ***.***.***.***, server: tests.stockfishchess.org, request: "GET /api/active_runs HTTP/2.0", host: "tests.stockfishchess.org"
2024/05/14 00:53:59 [error] 13428#13428: *1073 challenging request, client: ***.***.***.***, server: tests.stockfishchess.org, request: "GET /api/active_runs HTTP/2.0", host: "tests.stockfishchess.org"
set 30 connections in nginx and 10 connections in waitress (like the 10 threads). IMO the web works with queue to deal with spikes, and we don't want the worker making request too fast when the server is overloaded....
Commented out the nginx limit.
set 30 connections in nginx and 10 connections in waitress (like the 10 threads). IMO the web works with queue to deal with spikes, and we don't want the worker making request too fast when the server is overloaded....
That is of course true. But one can overdo it which leads to buffer bloat https://en.wikipedia.org/wiki/Bufferbloat . Currently we have an ngnix queue, a waitress queue and a queue in Fishtest (of threads waiting for locks to clear) and it becomes hard to understand how they interact.
I think the waitress queue should be eliminated since nginx is already queing so that Fishtest can take control.
Of course I may be wrong but it seems natural to me that if Fishtest can handle say 7 request task calls per second it should be given the opportunity to reject more. Now it does not have that opportunity since it seems to me that most connections are queued by waitress until they are closed by nginx after 30 seconds.
I really think we should understand and fix why fishtest can only handle that few request_task calls?
If we can go by the profile in https://github.com/official-stockfish/fishtest/issues/2003#issue-2293498603 we see that it spends most time in:
- accessing the workerdb to get worker info (might depend on workload with workers), might be hard to avoid?
- counting committed games by iterating overall tasks (grows with workers) to avoid assigning to many
- counting connections to see if it reaches the limit by iterating over active_tasks (grows with workers)
- checking if a worker name is already connected by iterating over active_tasks (grows with workers).
so, somehow the 3 of 4 hotspots are checks, can they be written or implemented differently / are they even needed at this point in the code?
if we can't fix request_task, probably, we should rate limit it at the nginx level, while at the same time modify the worker to catch that condition. Not really as an error (i.e. no other messages sent to the server), but just for the worker to go to sleep and retry a while later, i.e. not really an error condition, rather a signal.