flow
flow copied to clipboard
Need to limit DB connections at the persistence layer, especially during reload/replay
I've a DB with a bunch of active graphs in it - a few thousand.
During replay at a shard restart, I'm seeing the following:
time="2017-11-08T15:57:39Z" level=info msg="Created graph actor supervisor-0/0246f0d7-da44-4de1-97e9-30afd2d6d998" graph_id=0246f0d7-da44-4de1-97e9-30afd2d6d998 logger=supervisor-0
time="2017-11-08T15:57:39Z" level=info msg="Created graph actor supervisor-0/35203cfc-7c0e-4216-847e-7bcc1869fa63" graph_id=35203cfc-7c0e-4216-847e-7bcc1869fa63 logger=supervisor-0
time="2017-11-08T15:57:39Z" level=error msg="Error getting snapshot value from DB Error 1040: Too many connections" actor_name=supervisor-0/7a9f6777-ce90-4dc8-89c3-3846e83b0fac logger=persistence
time="2017-11-08T15:57:39Z" level=error msg="Error getting snapshot value from DB Error 1040: Too many connections" actor_name=supervisor-0/71c3eba2-43f0-41e6-a9a5-1ef836b74513 logger=persistence
time="2017-11-08T15:57:39Z" level=error msg="Error getting snapshot value from DB Error 1040: Too many connections" actor_name=supervisor-0/f4602309-139e-4fc5-b226-6f33a7528346 logger=persistence
Error 1040 indicates too many parallel connections to the DB. It looks like we're generating these in parallel.
That followed later (immediately after the initial replay) with:
2017/11/08 15:57:39 [MAILBOX] [ACTOR] Recovering actor="nonhost/supervisor-0/7a9f6777-ce90-4dc8-89c3-3846e83b0fac" reason="Error 1040: Too many connections" stacktrace="github.com/fnproject/flow/persistence.(*sqlProvider).GetEvents:110"
time="2017-11-08T15:57:39Z" level=warning msg="Stopping failed graph actor due to error: Error 1040: Too many connections" logger=graphmanager_actor
2017/11/08 15:57:39 [ACTOR] [SUPERVISION] time="2017-11-08T15:57:39Z" level=error msg="Error getting events value from DB " actor_name=supervisor-0/2a3dada2-acf3-40cb-94dc-736d02bfb624 error="Error 1040: Too many connections" logger=persistence
actor="nonhost/supervisor-0/7a9f6777-ce90-4dc8-89c3-3846e83b0fac" time="2017-11-08T15:57:39Z" level=error msg="Error getting events value from DB " actor_name=supervisor-0/a91835ca-51fe-4142-b519-7d3ae3f341d3 error="Error 1040: Too many connections" logger=persistence
directive="StopDirective" reason="Error 1040: Too many connections"2017/11/08 15:57:39 [MAILBOX] [ACTOR] Recovering
actor="nonhost/supervisor-0/2a3dada2-acf3-40cb-94dc-736d02bfb624" reason="Error 1040: Too many connections" stacktrace="github.com/fnproject/flow/persistence.(*sqlProvider).GetEvents:110"
time="2017-11-08T15:57:39Z" level=error msg="Error getting events value from DB " actor_name=supervisor-0/088b4b09-6f1b-4bb0-a543-1a52c67920ca error="Error 1040: Too many connections" logger=persistence
2017/11/08 15:57:39 [MAILBOX] [ACTOR] Recovering time="2017-11-08T15:57:39Z" level=error msg="Error getting events value from DB " actor_name=supervisor-0/71c3eba2-43f0-41e6-a9a5-1ef836b74513 error="Error 1040: Too many connections" logger=persistence
actor="nonhost/supervisor-0/a91835ca-51fe-4142-b519-7d3ae3f341d3" reason="Error 1040: Too many connections" time="2017-11-08T15:57:39Z" level=warning msg="Stopping failed graph actor due to error: Error 1040: Too many connections" logger=graphmanager_actor
2017/11/08 15:57:39 [ACTOR] [SUPERVISION] stacktrace="github.com/fnproject/flow/persistence.(*sqlProvider).GetEvents:110" 2017/11/08 15:57:39 [MAILBOX]
After startup the flow service stabilises and normal operation is possible.
Complete trace from one graph. (The actor is not eventually materialised in this case.)
time="2017-11-08T15:57:39Z" level=info msg="Created graph actor supervisor-0/8985ff69-d5b8-4a47-a288-127e07847ef0" graph_id=8985ff69-d5b8-4a47-a288-127e07847ef0 logger=supervisor-0
time="2017-11-08T15:57:39Z" level=error msg="Error getting snapshot value from DB Error 1040: Too many connections" actor_name=supervisor-0/8985ff69-d5b8-4a47-a288-127e07847ef0 logger=persistence
2017/11/08 15:57:39 [ACTOR] [SUPERVISION]reason="Error 1040: Too many connections" time="2017-11-08T15:57:39Z" level=error msg="Error getting events value from DB " actor_name=supervisor-0/8985ff69-d5b8-4a47-a288-127e07847ef0 error="Error 1040: Too many connections" logger=persistence
actor="nonhost/supervisor-0/8985ff69-d5b8-4a47-a288-127e07847ef0" reason="Error 1040: Too many connections" stacktrace="github.com/fnproject/flow/persistence.(*sqlProvider).GetEvents:110"
2017/11/08 15:57:39 [ACTOR] [SUPERVISION] reason="Error 1040: Too many connections" actor="nonhost/supervisor-0/8985ff69-d5b8-4a47-a288-127e07847ef0" stacktrace="github.com/fnproject/flow/persistence.(*sqlProvider).GetEvents:110" directive="StopDirective"
time="2017-11-08T15:57:39Z" level=info msg="Graph actor terminated" graph_id=8985ff69-d5b8-4a47-a288-127e07847ef0 logger=supervisor-0
time="2017-11-08T15:57:39Z" level=warning msg="Graph actor crashed" graph_id=8985ff69-d5b8-4a47-a288-127e07847ef0 logger=supervisor-0
Curling to get the graph details after startup for that graph will pull it in later on, as you might expect:
curl http://10.0.20.4:8081/graph/8985ff69-d5b8-4a47-a288-127e07847ef0
{"stages":{"0":{"type":"terminationHook","status":"pending","dependencies":["_termination"]},"1":{"type":"completedValue","status":"successful"}},"function_id":"t/test","graph_id":"8985ff69-d5b8-4a47-a288-127e07847ef0"}
(Note, this graph was only partially constructed and not committed earlier.)
The kubectl logs from the flow service:
time="2017-11-08T16:35:30Z" level=debug msg="Got shard 0 for graph 8985ff69-d5b8-4a47-a288-127e07847ef0" logger=sharding
time="2017-11-08T16:35:30Z" level=info msg="Resolved shard" cluster_shard=0 graph_id=8985ff69-d5b8-4a47-a288-127e07847ef0 logger=cluster
time="2017-11-08T16:35:30Z" level=info msg="Resolved node" cluster_node=flow-service-0 graph_id=8985ff69-d5b8-4a47-a288-127e07847ef0 logger=cluster
time="2017-11-08T16:35:30Z" level=info msg="Processing request locally" logger=cluster
time="2017-11-08T16:35:30Z" level=debug msg="Getting graph stage" logger=graphmanager_actor
time="2017-11-08T16:35:30Z" level=debug msg="Got shard 0 for graph 8985ff69-d5b8-4a47-a288-127e07847ef0" logger=sharding
time="2017-11-08T16:35:30Z" level=info msg="Created graph actor supervisor-0/8985ff69-d5b8-4a47-a288-127e07847ef0" graph_id=8985ff69-d5b8-4a47-a288-127e07847ef0 logger=supervisor-0
time="2017-11-08T16:35:30Z" level=info msg="Adding termination hook" logger=graph stage_id=0
time="2017-11-08T16:35:30Z" level=info msg="Adding stage to graph" function_id=t/test graph_id=8985ff69-d5b8-4a47-a288-127e07847ef0 logger=graph op=completedValue stage=1
time="2017-11-08T16:35:30Z" level=info msg="Completing node" function_id=t/test graph_id=8985ff69-d5b8-4a47-a288-127e07847ef0 logger=graph stage_id=1 success=true
time="2017-11-08T16:35:30Z" level=info msg="Retrieved stages from storage" function_id=t/test graph_id=8985ff69-d5b8-4a47-a288-127e07847ef0 logger=graph
[GIN] 2017/11/08 - 16:35:30 | 200 | 543.781069ms | 10.99.24.0 | GET /graph/8985ff69-d5b8-4a47-a288-127e07847ef0
Here's a graph that was created during a flow load test run. Again, the persistence layer is operating in parallel and we're bumping up against the DB connection limit (which is set to 500 in this deployment).
time="2017-11-08T16:39:50Z" level=info msg="Generated new graph ID 64df9229-de84-4416-b60e-808a36141c0c" logger=server
time="2017-11-08T16:39:50Z" level=debug msg="Got shard 0 for graph 64df9229-de84-4416-b60e-808a36141c0c" logger=sharding
time="2017-11-08T16:39:50Z" level=info msg="Resolved shard" cluster_shard=0 graph_id=64df9229-de84-4416-b60e-808a36141c0c logger=cluster
time="2017-11-08T16:39:50Z" level=info msg="Resolved node" cluster_node=flow-service-0 graph_id=64df9229-de84-4416-b60e-808a36141c0c logger=cluster
time="2017-11-08T16:39:50Z" level=debug msg="Creating graph" graph_id=64df9229-de84-4416-b60e-808a36141c0c logger=graphmanager_actor
time="2017-11-08T16:39:50Z" level=debug msg="Got shard 0 for graph 64df9229-de84-4416-b60e-808a36141c0c" logger=sharding
time="2017-11-08T16:39:51Z" level=info msg="Created graph actor supervisor-0/64df9229-de84-4416-b60e-808a36141c0c" graph_id=64df9229-de84-4416-b60e-808a36141c0c logger=supervisor-0
time="2017-11-08T16:40:30Z" level=debug msg="Got shard 0 for graph 64df9229-de84-4416-b60e-808a36141c0c" logger=sharding
time="2017-11-08T16:40:30Z" level=info msg="Resolved shard" cluster_shard=0 graph_id=64df9229-de84-4416-b60e-808a36141c0c logger=cluster
time="2017-11-08T16:40:30Z" level=info msg="Resolved node" cluster_node=flow-service-0 graph_id=64df9229-de84-4416-b60e-808a36141c0c logger=cluster
time="2017-11-08T16:40:30Z" level=debug msg="Got shard 0 for graph 64df9229-de84-4416-b60e-808a36141c0c" logger=sharding
[GIN] 2017/11/08 - 16:40:34 | 200 | 3.647872425s | 10.99.24.0 | POST /graph/64df9229-de84-4416-b60e-808a36141c0c/terminationHook
time="2017-11-08T16:40:34Z" level=debug msg="Got shard 0 for graph 64df9229-de84-4416-b60e-808a36141c0c" logger=sharding
time="2017-11-08T16:40:34Z" level=info msg="Resolved shard" cluster_shard=0 graph_id=64df9229-de84-4416-b60e-808a36141c0c logger=cluster
time="2017-11-08T16:40:34Z" level=info msg="Resolved node" cluster_node=flow-service-0 graph_id=64df9229-de84-4416-b60e-808a36141c0c logger=cluster
time="2017-11-08T16:40:34Z" level=debug msg="Got shard 0 for graph 64df9229-de84-4416-b60e-808a36141c0c" logger=sharding
time="2017-11-08T16:40:40Z" level=info msg="Adding stage to graph" function_id=t/test graph_id=64df9229-de84-4416-b60e-808a36141c0c logger=graph op=completedValue stage=1
time="2017-11-08T16:40:40Z" level=info msg="Completing node" function_id=t/test graph_id=64df9229-de84-4416-b60e-808a36141c0c logger=graph stage_id=1 success=true
[GIN] 2017/11/08 - 16:40:40 | 200 | 5.511962778s | 10.99.24.0 | POST /graph/64df9229-de84-4416-b60e-808a36141c0c/completedValue
time="2017-11-08T16:40:40Z" level=debug msg="Got shard 0 for graph 64df9229-de84-4416-b60e-808a36141c0c" logger=sharding
time="2017-11-08T16:40:40Z" level=info msg="Resolved shard" cluster_shard=0 graph_id=64df9229-de84-4416-b60e-808a36141c0c logger=cluster
time="2017-11-08T16:40:40Z" level=info msg="Resolved node" cluster_node=flow-service-0 graph_id=64df9229-de84-4416-b60e-808a36141c0c logger=cluster
time="2017-11-08T16:40:40Z" level=debug msg="Got shard 0 for graph 64df9229-de84-4416-b60e-808a36141c0c" logger=sharding
2017/11/08 16:40:40 [ACTOR] [SUPERVISION] 2017/11/08 16:40:40 [MAILBOX] [ACTOR] Recovering actor="nonhost/supervisor-0/64df9229-de84-4416-b60e-808a36141c0c" reason="Error 1040: Too many connections" actor="nonhost/supervisor-0/27b6e098-aa7d-43b8-882a-8768af8461c0" stacktrace="github.com/fnproject/flow/persistence.(*sqlProvider).PersistEvent:145"
2017/11/08 16:40:40 [ACTOR] [SUPERVISION] actor="nonhost/supervisor-0/64df9229-de84-4416-b60e-808a36141c0c" directive="StopDirective" reason="Error 1040: Too many connections"
time="2017-11-08T16:40:40Z" level=info msg="Graph actor terminated" graph_id=64df9229-de84-4416-b60e-808a36141c0c logger=supervisor-0
time="2017-11-08T16:40:40Z" level=warning msg="Graph actor crashed" graph_id=64df9229-de84-4416-b60e-808a36141c0c logger=supervisor-0
[GIN] 2017/11/08 - 16:41:40 | 500 | 1m0.002378519s | 10.99.24.0 | POST /graph/64df9229-de84-4416-b60e-808a36141c0c/invokeFunction?functionId=.%2Fsleep