noobaa-core
noobaa-core copied to clipboard
S3 and WebUI not responsive after heavy load by NoobaaFunctions
Environment info
- NooBaa Version: master+30ee3535b5 , built in house with additional dependencies
- Platform: OKD 4.5
Actual behavior
WebUI is not working after a heavy load was produced due to Noobaa-Function calls. The limit seems to be somewhere around 5k Events handled. Till this border is hit the WebUI is responsive and "Function Statistics" are displayed correctly. After hitting this "limit" WebUI does not display any information.
Durig this situation, no S3 connections to noobaa are possible.
After some time, like 10 min, the WebUI is reachable but the "Function Statistics" is not displaying anything.
Expected behavior
There should be no limit for "Function Statistics" to be displayed.
Steps to reproduce
- Write a minimal handler function and call it for more than 5k times within the displayed time (60 min).
More information - Screenshots / Logs / Other output
We have a static Endpoint-Count of five and have limits set appropriate to the load caused by the function-calls (limit is 4 core 8Gi).
Noobaa-Core is running with limits of 1 Core 4 Gi.
Load Scenario during Upload of files:
WebUI with last update of Statistics function just befor it crashes:
WebUI is no longer updated dynamically and after reload of the page we get:
There is no heavy load on the pods during the this absence of the WebUI. This is the load while it is not working and no load is on the Endpoints:
This situation is present for at least 10 minutes before the WebUI is working properly, but the "Function Statistics" is not displayed.
It would be great if you could give me some target to look for if we reach this situation.
We are hitting this issue more and more often and it is becoming a growing problem right now. It would be awsome to get some hints where to go on digging for the root cause of this issue.
We just noticed some strange Logs in the Core-Node:
Some Error in ´´read_usage_gridfs´´:
Nov-20 19:30:00.332 [HostedAgents/32] [ERROR] core.agent.block_store_services.block_store_mongo:: read_usage_gridfs had error: TypeError: Cannot read property 'stats' of undefined at /root/node_modules/noobaa-core/src/agent/block_store_services/block_store_mongo.js:72:78 at tryCatcher (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/util.js:16:23) at Promise._settlePromiseFromHandler (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/promise.js:547:31) at Promise._settlePromise (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/promise.js:604:18) at Promise._settlePromiseCtx (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/promise.js:641:10) at _drainQueueStep (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/async.js:97:12) at _drainQueue (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/async.js:86:9) at Async._drainQueues (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/async.js:102:5) at Immediate.Async.drainQueues [as _onImmediate] (/root/node_modules/noobaa-core/node_modules/bluebird/24m/js/release/async.js:15:14) at processImmediate (internal/timers.js:456:21) at process.topLevelDomainCallback (domain.js:137:15)
And errors during fetching of metrics:
Nov-20 19:29:47.874 [WebServer/40] [ERROR] CONSOLE:: ERROR: { status: 404, message:"We dug the earth, but couldn't find /metrics" } ::ffff:100.122.42.11 - - [20/Nov/2020:19:29:47 +0000] "GET /metrics HTTP/1.1" 500 - "-" "Prometheus/2.19.2"
Hey @QuintinBecker
The error on block_store_mongo read_usage_gridfs can be ignored. In order to filter errors from the logs I would start by grouping the errors by source module like this:
cat log | grep ERROR | cut -d' ' -f 5 | sort | uniq -c | sort -n
It will require some iterations to try and get meaningful data - we might need to run this on the endpoint logs and perhaps the core logs, and we might want to change between ERROR and WARN and also to cut different fields from the log lines to group by.
Let me know if this yields anything. Thanks
Update: after some more investigation work by @QuintinBecker we saw that this issue was not reproducible when using the DB pod that the operator deploys. He found that the external DB that he uses is mongoDB 4.2 while our default DB image uses mongoDB 3.6. This version difference could very well be the root cause, but not necessarily. Thanks Quintin!
We have been working on this issue the past days and have built some test environments in cluster with internal and external DBs running.
In general we moved towards an external DB to get the setup running more stable and HA, so we hope to get the setup running with external DB (if this is really the root cause of this issue)
Yesterday we had some issues on a deployment with internal DB causing a fatal error within noobaa-core pod. The noobaa-db-0 pod hat one restart, unfortunately we did not get the logs saved in time. At that time noobaa was running for like 20 minutes and we had plenty of ressources available for all pods and it was a fresh deployment. Load was created by uploading a testset of 5k files with various sizes below 100kb each and one minimal noobaa-function.
Nov-30 14:57:57.716 [BGWorkers/5883] [L0] core.server.bg_services.cluster_master:: no local cluster info or server is not part of a cluster. therefore will be cluster master
Nov-30 14:57:57.718 [WebServer/5884] [L0] core.server.system_services.redirector:: publish_to_cluster: server_inter_process update_master_change { is_master: true } [ 'ws://[::ffff:127.0.0.1]:51580/(49c6p9s)', 'ws://[::ffff:127.0.0.1]:51584/(49khg1s)', 'fcall://fcall(4att1rg)', 'ws://[::ffff:1.2.3.4]:35784/(4fy9uno)', 'ws://[::ffff:1.2.3.4]:40612/(4g7p7ed)', 'ws://[::ffff:1.2.3.4]:42262/(4gap10e)', 'ws://[::ffff:1.2.3.4]:34416/(4imteuc)', 'ws://[::ffff:1.2.3.4]:57846/(4nndwwv)', 'ws://[::ffff:1.2.3.4]:52436/(5iwp04g)' ]
Nov-30 14:57:59.681 [BGWorkers/5883] [L0] core.server.bg_services.scrubber:: SCRUBBER: BEGIN
Nov-30 14:57:59.683 [BGWorkers/5883] [ERROR] core.server.bg_services.scrubber:: SCRUBBER: ERROR MongoError: Topology was destroyed
at initializeCursor (/root/node_modules/noobaa-core/node_modules/mongodb-core/lib/cursor.js:596:25)
at nextFunction (/root/node_modules/noobaa-core/node_modules/mongodb-core/lib/cursor.js:456:12)
at Cursor.next (/root/node_modules/noobaa-core/node_modules/mongodb-core/lib/cursor.js:766:3)
at Cursor._next (/root/node_modules/noobaa-core/node_modules/mongodb/lib/cursor.js:216:36)
at fetchDocs (/root/node_modules/noobaa-core/node_modules/mongodb/lib/operations/cursor_ops.js:217:12)
at toArray (/root/node_modules/noobaa-core/node_modules/mongodb/lib/operations/cursor_ops.js:247:3)
at /root/node_modules/noobaa-core/node_modules/mongodb/lib/utils.js:433:24
at Promise._execute (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/debuggability.js:384:9)
at Promise._resolveFromExecutor (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/promise.js:518:18)
at new Promise (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/promise.js:103:10)
at executeOperation (/root/node_modules/noobaa-core/node_modules/mongodb/lib/utils.js:428:10)
at Cursor.toArray (/root/node_modules/noobaa-core/node_modules/mongodb/lib/cursor.js:829:10)
at MongoCollection.find (/root/node_modules/noobaa-core/src/util/mongo_client.js:59:89)
at MDStore.iterate_all_chunks (/root/node_modules/noobaa-core/src/server/object_services/md_store.js:1305:29)
at Object.background_worker [as run_batch] (/root/node_modules/noobaa-core/src/server/bg_services/scrubber.js:36:64)
at /root/node_modules/noobaa-core/src/util/background_scheduler.js:41:39 {
[Symbol(mongoErrorContextSymbol)]: {}
} MongoError: Topology was destroyed
at initializeCursor (/root/node_modules/noobaa-core/node_modules/mongodb-core/lib/cursor.js:596:25)
at nextFunction (/root/node_modules/noobaa-core/node_modules/mongodb-core/lib/cursor.js:456:12)
at Cursor.next (/root/node_modules/noobaa-core/node_modules/mongodb-core/lib/cursor.js:766:3)
at Cursor._next (/root/node_modules/noobaa-core/node_modules/mongodb/lib/cursor.js:216:36)
at fetchDocs (/root/node_modules/noobaa-core/node_modules/mongodb/lib/operations/cursor_ops.js:217:12)
at toArray (/root/node_modules/noobaa-core/node_modules/mongodb/lib/operations/cursor_ops.js:247:3)
at /root/node_modules/noobaa-core/node_modules/mongodb/lib/utils.js:433:24
at Promise._execute (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/debuggability.js:384:9)
at Promise._resolveFromExecutor (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/promise.js:518:18)
at new Promise (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/promise.js:103:10)
at executeOperation (/root/node_modules/noobaa-core/node_modules/mongodb/lib/utils.js:428:10)
at Cursor.toArray (/root/node_modules/noobaa-core/node_modules/mongodb/lib/cursor.js:829:10)
at MongoCollection.find (/root/node_modules/noobaa-core/src/util/mongo_client.js:59:89)
at MDStore.iterate_all_chunks (/root/node_modules/noobaa-core/src/server/object_services/md_store.js:1305:29)
at Object.background_worker [as run_batch] (/root/node_modules/noobaa-core/src/server/bg_services/scrubber.js:36:64)
at /root/node_modules/noobaa-core/src/util/background_scheduler.js:41:39
Nov-30 14:58:00.084 [WebServer/5884] [ERROR] CONSOLE:: RPC._on_request: ERROR srv node_api.aggregate_nodes reqid 281@wss://127.0.0.1:8443/(5bab8wx) connid ws://[::ffff:127.0.0.1]:51612/(5bc4s0z) MongoError: Topology was destroyed
at initializeCursor (/root/node_modules/noobaa-core/node_modules/mongodb-core/lib/cursor.js:596:25)
at nextFunction (/root/node_modules/noobaa-core/node_modules/mongodb-core/lib/cursor.js:456:12)
at AggregationCursor.Cursor.next (/root/node_modules/noobaa-core/node_modules/mongodb-core/lib/cursor.js:766:3)
at AggregationCursor.Cursor._next (/root/node_modules/noobaa-core/node_modules/mongodb/lib/cursor.js:216:36)
at fetchDocs (/root/node_modules/noobaa-core/node_modules/mongodb/lib/operations/cursor_ops.js:217:12)
at toArray (/root/node_modules/noobaa-core/node_modules/mongodb/lib/operations/cursor_ops.js:247:3)
at /root/node_modules/noobaa-core/node_modules/mongodb/lib/utils.js:433:24
at Promise._execute (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/debuggability.js:384:9)
at Promise._resolveFromExecutor (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/promise.js:518:18)
at new Promise (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/promise.js:103:10)
at executeOperation (/root/node_modules/noobaa-core/node_modules/mongodb/lib/utils.js:428:10)
at AggregationCursor.Cursor.toArray (/root/node_modules/noobaa-core/node_modules/mongodb/lib/cursor.js:829:10)
at MongoCollection.groupBy (/root/node_modules/noobaa-core/src/util/mongo_client.js:69:114)
at IoStatsStore._get_stats_for_resource_type (/root/node_modules/noobaa-core/src/server/analytic_services/io_stats_store.js:98:31)
at IoStatsStore.get_all_nodes_stats (/root/node_modules/noobaa-core/src/server/analytic_services/io_stats_store.js:77:21)
at NodesMonitor.get_nodes_stats (/root/node_modules/noobaa-core/src/server/node_services/nodes_monitor.js:2794:59)
Nov-30 14:58:00.085 [WebServer/5884] [ERROR] core.rpc.rpc:: RPC._request: response ERROR srv node_api.aggregate_nodes params { query: { pools: [ 'noobaa-default-backing-store', [length]: 1 ], skip_cloud_nodes: [90mundefined, skip_mongo_nodes: [90mundefined }, group_by: 'pool' } reqid 281@wss://127.0.0.1:8443/(5bab8wx) took [0.9+1.0=1.9] Error: Topology was destroyed
at RpcRequest._set_response (/root/node_modules/noobaa-core/src/rpc/rpc_request.js:130:26)
at RPC._on_response (/root/node_modules/noobaa-core/src/rpc/rpc.js:423:32)
at RPC._on_message (/root/node_modules/noobaa-core/src/rpc/rpc.js:759:22)
at RpcWsConnection.<anonymous> (/root/node_modules/noobaa-core/src/rpc/rpc.js:594:48)
at RpcWsConnection.emit (events.js:315:20)
at RpcWsConnection.EventEmitter.emit (domain.js:482:12)
at RpcWsConnection.<anonymous> (/root/node_modules/noobaa-core/src/rpc/rpc_base_conn.js:75:22)
at RpcWsConnection.emit (events.js:315:20)
at RpcWsConnection.EventEmitter.emit (domain.js:482:12)
at WebSocket.<anonymous> (/root/node_modules/noobaa-core/src/rpc/rpc_ws.js:45:53)
at WebSocket.emit (events.js:315:20)
at WebSocket.EventEmitter.emit (domain.js:482:12)
at Receiver.receiverOnMessage (/root/node_modules/noobaa-core/node_modules/ws/lib/websocket.js:800:20)
at Receiver.emit (events.js:315:20)
at Receiver.EventEmitter.emit (domain.js:482:12)
at Receiver.dataMessage (/root/node_modules/noobaa-core/node_modules/ws/lib/receiver.js:414:14)
::ffff:1.2.3.4 - - [30/Nov/2020:14:58:00 +0000] "GET /metrics/bg_workers HTTP/1.1" 200 - "-" "Prometheus/2.19.2"
Nov-30 14:58:00.690 [BGWorkers/5883] [L0] core.server.bg_services.agent_blocks_reclaimer:: AGENT_BLOCKS_RECLAIMER: BEGIN
Nov-30 14:58:00.691 [BGWorkers/5883] [ERROR] core.server.bg_services.agent_blocks_reclaimer:: AGENT_BLOCKS_RECLAIMER: ERROR MongoError: Topology was destroyed
at initializeCursor (/root/node_modules/noobaa-core/node_modules/mongodb-core/lib/cursor.js:596:25)
at nextFunction (/root/node_modules/noobaa-core/node_modules/mongodb-core/lib/cursor.js:456:12)
at Cursor.next (/root/node_modules/noobaa-core/node_modules/mongodb-core/lib/cursor.js:766:3)
at Cursor._next (/root/node_modules/noobaa-core/node_modules/mongodb/lib/cursor.js:216:36)
at fetchDocs (/root/node_modules/noobaa-core/node_modules/mongodb/lib/operations/cursor_ops.js:217:12)
at toArray (/root/node_modules/noobaa-core/node_modules/mongodb/lib/operations/cursor_ops.js:247:3)
at /root/node_modules/noobaa-core/node_modules/mongodb/lib/utils.js:433:24
at Promise._execute (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/debuggability.js:384:9)
at Promise._resolveFromExecutor (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/promise.js:518:18)
at new Promise (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/promise.js:103:10)
at executeOperation (/root/node_modules/noobaa-core/node_modules/mongodb/lib/utils.js:428:10)
at Cursor.toArray (/root/node_modules/noobaa-core/node_modules/mongodb/lib/cursor.js:829:10)
at MongoCollection.find (/root/node_modules/noobaa-core/src/util/mongo_client.js:59:89)
at MDStore.iterate_all_blocks (/root/node_modules/noobaa-core/src/server/object_services/md_store.js:1532:29)
at AgentBlocksReclaimer.iterate_all_blocks (/root/node_modules/noobaa-core/src/server/bg_services/agent_blocks_reclaimer.js:95:35)
at AgentBlocksReclaimer.run_agent_blocks_reclaimer (/root/node_modules/noobaa-core/src/server/bg_services/agent_blocks_reclaimer.js:39:39) {
[Symbol(mongoErrorContextSymbol)]: {}
} MongoError: Topology was destroyed
at initializeCursor (/root/node_modules/noobaa-core/node_modules/mongodb-core/lib/cursor.js:596:25)
at nextFunction (/root/node_modules/noobaa-core/node_modules/mongodb-core/lib/cursor.js:456:12)
at Cursor.next (/root/node_modules/noobaa-core/node_modules/mongodb-core/lib/cursor.js:766:3)
at Cursor._next (/root/node_modules/noobaa-core/node_modules/mongodb/lib/cursor.js:216:36)
at fetchDocs (/root/node_modules/noobaa-core/node_modules/mongodb/lib/operations/cursor_ops.js:217:12)
at toArray (/root/node_modules/noobaa-core/node_modules/mongodb/lib/operations/cursor_ops.js:247:3)
at /root/node_modules/noobaa-core/node_modules/mongodb/lib/utils.js:433:24
at Promise._execute (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/debuggability.js:384:9)
at Promise._resolveFromExecutor (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/promise.js:518:18)
at new Promise (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/promise.js:103:10)
at executeOperation (/root/node_modules/noobaa-core/node_modules/mongodb/lib/utils.js:428:10)
at Cursor.toArray (/root/node_modules/noobaa-core/node_modules/mongodb/lib/cursor.js:829:10)
at MongoCollection.find (/root/node_modules/noobaa-core/src/util/mongo_client.js:59:89)
at MDStore.iterate_all_blocks (/root/node_modules/noobaa-core/src/server/object_services/md_store.js:1532:29)
at AgentBlocksReclaimer.iterate_all_blocks (/root/node_modules/noobaa-core/src/server/bg_services/agent_blocks_reclaimer.js:95:35)
at AgentBlocksReclaimer.run_agent_blocks_reclaimer (/root/node_modules/noobaa-core/src/server/bg_services/agent_blocks_reclaimer.js:39:39)
Nov-30 14:58:01.253 [WebServer/5884] [ERROR] CONSOLE:: RPC._on_request: ERROR srv system_api.read_system reqid wss://noobaa-mgmt.noobaa.svc.cluster.local:443/rpc/-169 connid ws://[::ffff:1.2.3.4]:52436/(5iwp04g) MongoError: Topology was destroyed
at initializeCursor (/root/node_modules/noobaa-core/node_modules/mongodb-core/lib/cursor.js:596:25)
at nextFunction (/root/node_modules/noobaa-core/node_modules/mongodb-core/lib/cursor.js:456:12)
at AggregationCursor.Cursor.next (/root/node_modules/noobaa-core/node_modules/mongodb-core/lib/cursor.js:766:3)
at AggregationCursor.Cursor._next (/root/node_modules/noobaa-core/node_modules/mongodb/lib/cursor.js:216:36)
at fetchDocs (/root/node_modules/noobaa-core/node_modules/mongodb/lib/operations/cursor_ops.js:217:12)
at toArray (/root/node_modules/noobaa-core/node_modules/mongodb/lib/operations/cursor_ops.js:247:3)
at /root/node_modules/noobaa-core/node_modules/mongodb/lib/utils.js:433:24
at Promise._execute (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/debuggability.js:384:9)
at Promise._resolveFromExecutor (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/promise.js:518:18)
at new Promise (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/promise.js:103:10)
at executeOperation (/root/node_modules/noobaa-core/node_modules/mongodb/lib/utils.js:428:10)
at AggregationCursor.Cursor.toArray (/root/node_modules/noobaa-core/node_modules/mongodb/lib/cursor.js:829:10)
at MongoCollection.groupBy (/root/node_modules/noobaa-core/src/util/mongo_client.js:69:114)
at BucketStatsStore.get_all_buckets_stats (/root/node_modules/noobaa-core/src/server/analytic_services/bucket_stats_store.js:56:35)
at Object.read_system (/root/node_modules/noobaa-core/src/server/system_services/system_server.js:499:52)
at Object.server_func (/root/node_modules/noobaa-core/src/rpc/rpc.js:103:48)
Nov-30 14:58:01.258 [WebServer/5884] [L0] core.server.node_services.nodes_monitor:: could not find node for root path, taking the first in the list. drives = [object Object]
Nov-30 14:58:01.260 [WebServer/5884] [ERROR] CONSOLE:: RPC._on_request: ERROR srv func_api.list_funcs reqid 116@fcall://fcall(4att1rg) connid fcall://fcall(4att1rg) MongoError: Topology was destroyed
at initializeCursor (/root/node_modules/noobaa-core/node_modules/mongodb-core/lib/cursor.js:596:25)
at nextFunction (/root/node_modules/noobaa-core/node_modules/mongodb-core/lib/cursor.js:456:12)
at Cursor.next (/root/node_modules/noobaa-core/node_modules/mongodb-core/lib/cursor.js:766:3)
at Cursor._next (/root/node_modules/noobaa-core/node_modules/mongodb/lib/cursor.js:216:36)
at fetchDocs (/root/node_modules/noobaa-core/node_modules/mongodb/lib/operations/cursor_ops.js:217:12)
at toArray (/root/node_modules/noobaa-core/node_modules/mongodb/lib/operations/cursor_ops.js:247:3)
at /root/node_modules/noobaa-core/node_modules/mongodb/lib/utils.js:433:24
at Promise._execute (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/debuggability.js:384:9)
at Promise._resolveFromExecutor (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/promise.js:518:18)
at new Promise (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/promise.js:103:10)
at executeOperation (/root/node_modules/noobaa-core/node_modules/mongodb/lib/utils.js:428:10)
at Cursor.toArray (/root/node_modules/noobaa-core/node_modules/mongodb/lib/cursor.js:829:10)
at MongoCollection.find (/root/node_modules/noobaa-core/src/util/mongo_client.js:59:89)
at /root/node_modules/noobaa-core/src/server/func_services/func_store.js:101:57
at tryCatcher (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/util.js:16:23)
at Promise._settlePromiseFromHandler (/root/node_modules/noobaa-core/node_modules/bluebird/js/release/promise.js:547:31)
@QuintinBecker Thanks! Did you observe mongo seg faults as seen here - https://github.com/noobaa/noobaa-core/issues/5971 ?
We suspected it was because we had to stay on mongo 3.6 to keep using the AGPL license (instead of the new SSPL license). We haven't tried it but it should be easy to try with noobaa install --db-image mongo:3.6.21
to use a newer container image provided by mongodb with SSPL license, which unless your business is to sell "mongo as a service" it is the same as AGPL, but I'm not sure if we will need to customize the command or envs to make it work.
@guymguym Currently the logs of the mongodb vanished.. i will try to reproduce them and look for a seg fault.
Nevertheless the updated image does not work and only outputs:
bash: /opt/rh/rh-mongodb36/root/usr/bin/mongod: No such file or directory
Do you have an other image we could try?
One image we tried to use to debug this was created by @jackyalbo here - https://hub.docker.com/r/jalbo/mongodbg3.6.3 I don't recall what was the exact debugging we enabled there - Maybe Jacky can say.
Another option is to update the noobaa-db sts to run the container with the proper command for the image and our configs.
You can see the command deployed by the operator here - which assumes this is the centos7-mongo36 image: https://github.com/noobaa/noobaa-operator/blob/be557303aaa4ef66ef00dd9061cd4083277ded61/deploy/internal/statefulset-db.yaml#L43-L51
#--------------------#
# DATABASE CONTAINER #
#--------------------#
- name: db
image: NOOBAA_DB_IMAGE
command:
- bash
- -c
- /opt/rh/rh-mongodb36/root/usr/bin/mongod --port 27017 --bind_ip_all --dbpath /data/mongo/cluster/shard1
But for the official mongo images we probably need to change with kubectl edit sts noobaa-db
to use args instead of command (untested) -
- name: db
image: mongo:
args:
- "--port"
- "27017"
- "--bind_ip_all"
- "--dbpath"
- "/data/mongo/cluster/shard1"
resources:
...
Actually I think you can do a simpler thing and just remove the full path to mongod from the command in the noobaa-db sts (/opt/rh/rh-mongodb36/root/usr/bin/) because it's already on the shell PATH.
@guymguym just reproduced it with the normal MongoDB image and catched the logs. It is a segmentation fault which seems to be linked to "func_stats" see here:
2020-12-07T07:25:57.126+0000 I NETWORK [conn35] received client metadata from 100.122.41.231:49246 conn: { driver: { name: "nodejs", version: "3.2.7" }, os: { type: "Linux", name: "linux", architecture: "x64", version: "5.6.19-300.fc32.x86_64" }, platform: "Node.js v12.18.2, LE, mongodb-core: 3.2.7" }
2020-12-07T08:17:52.932+0000 I COMMAND [conn32] command nbcore.func_stats command: mapReduce { mapreduce: "func_stats", map: "function map() {
const key = Math.floor(this.time.valueOf() / step) * step;
const res = this.error ? {
invoked: 1,
...", reduce: "function reduce(key, values) {
const reduced = values.reduce((bin, other) => {
bin.invoked += other.invoked;
bin.fulfi...", finalize: "function finalize(key, bin) {
const response_times = bin.completed_response_times
.sort((a, b) => a - b);
return {
...", query: { system: ObjectId('5fca19f0ab1a5d002871d9a9'), func: ObjectId('5fca1a83ab1a5d002871d9d6'), time: { $gte: new Date(1607325600000), $lt: new Date(1607329200000) } }, scope: { max_samples: 10000, percentiles: [ 0.5, 0.9, 0.99 ], step: 300000 }, out: { inline: 1 }, lsid: { id: UUID("537f4f37-678f-414b-8612-9db772744e9d") }, $db: "nbcore" } planSummary: IXSCAN { system: 1, id: 1, latency_ms: 1 } keysExamined:783 docsExamined:783 numYields:6 reslen:2385 locks:{ Global: { acquireCount: { r: 34 } }, Database: { acquireCount: { r: 2, R: 15 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_msg 102ms
2020-12-07T08:18:34.008+0000 I COMMAND [conn33] command nbcore.func_stats command: mapReduce { mapreduce: "func_stats", map: "function map() {
const key = Math.floor(this.time.valueOf() / step) * step;
const res = this.error ? {
invoked: 1,
...", reduce: "function reduce(key, values) {
const reduced = values.reduce((bin, other) => {
bin.invoked += other.invoked;
bin.fulfi...", finalize: "function finalize(key, bin) {
const response_times = bin.completed_response_times
.sort((a, b) => a - b);
return {
...", query: { system: ObjectId('5fca19f0ab1a5d002871d9a9'), func: ObjectId('5fca1a83ab1a5d002871d9d6'), time: { $gte: new Date(1607325600000), $lt: new Date(1607329200000) } }, scope: { max_samples: 10000, percentiles: [ 0.5, 0.9, 0.99 ], step: 300000 }, out: { inline: 1 }, lsid: { id: UUID("537f4f37-678f-414b-8612-9db772744e9d") }, $db: "nbcore" } planSummary: IXSCAN { system: 1, id: 1, latency_ms: 1 } keysExamined:939 docsExamined:939 numYields:7 reslen:2385 locks:{ Global: { acquireCount: { r: 40 } }, Database: { acquireCount: { r: 2, R: 18 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_msg 103ms
..........
2020-12-07T08:30:31.066+0000 F - [conn32] Invalid access at address: 0
2020-12-07T08:30:31.067+0000 F - [conn32] Got signal: 11 (Segmentation fault).
0x5579227f7f1a 0x5579227f727e 0x5579227f78cc 0x557921930726 0x7f0ad967a5d0 0x557921e042b0 0x557921def856 0x557921e0995b 0x557921e0a02d 0x557921e0ac70 0x557921beb416 0x557921df9cfa 0x557921c29c75 0x557921c2a9cd 0x557921b3e619 0x557921b41532 0x557921b49cdd 0x557921b4b14d 0x557921b5b819 0x3ad7ed2fa26d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"557920CE5000","o":"1B12F1A"},{"b":"557920CE5000","o":"1B1227E"},{"b":"557920CE5000","o":"1B128CC"},{"b":"557920CE5000","o":"C4B726"},{"b":"7F0AD966B000","o":"F5D0"},{"b":"557920CE5000","o":"111F2B0"},{"b":"557920CE5000","o":"110A856"},{"b":"557920CE5000","o":"112495B"},{"b":"557920CE5000","o":"112502D"},{"b":"557920CE5000","o":"1125C70"},{"b":"557920CE5000","o":"F06416"},{"b":"557920CE5000","o":"1114CFA"},{"b":"557920CE5000","o":"F44C75"},{"b":"557920CE5000","o":"F459CD"},{"b":"557920CE5000","o":"E59619"},{"b":"557920CE5000","o":"E5C532"},{"b":"557920CE5000","o":"E64CDD"},{"b":"557920CE5000","o":"E6614D"},{"b":"557920CE5000","o":"E76819"},{"b":"0","o":"3AD7ED2FA26D"}],"processInfo":{ "mongodbVersion" : "3.6.3", "gitVersion" : "nogitversion", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "5.6.19-300.fc32.x86_64", "version" : "#1 SMP Wed Jun 17 16:10:48 UTC 2020", "machine" : "x86_64" }, "somap" : [ { "b" : "557920CE5000", "elfType" : 3, "buildId" : "994546E1D7E0DB84D44F8FFCE44321A3F942C8B6" }, { "b" : "7FFEB8347000", "elfType" : 3, "buildId" : "FD19942F7D3FCF42FE74D85D4100187FF27113D7" }, { "b" : "7F0ADC546000", "path" : "/opt/rh/rh-mongodb36/root/usr/lib64/libstemmer.so.rh-mongodb36-0", "elfType" : 3, "buildId" : "95E1975085E3E88948957C518EC54CF19A180AF8" }, { "b" : "7F0ADC330000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "B9D5F73428BD6AD68C96986B57BEA3B7CEDB9745" }, { "b" : "7F0ADC12A000", "path" : "/lib64/libsnappy.so.1", "elfType" : 3, "buildId" : "3CEB901120465B031FA92FA449A9AD981CEC1659" }, { "b" : "7F0ADBEA4000", "path" : "/opt/rh/rh-mongodb36/root/usr/lib64/libyaml-cpp.so.rh-mongodb36-0.5", "elfType" : 3, "buildId" : "6F5B95F01DE486B0DD06C8281E3BD6443A4006EC" }, { "b" : "7F0ADBC42000", "path" : "/lib64/libpcre.so.1", "elfType" : 3, "buildId" : "9CA3D11F018BEEB719CDB34BE800BF1641350D0A" }, { "b" : "7F0ADBA39000", "path" : "/lib64/libpcrecpp.so.0", "elfType" : 3, "buildId" : "8BBAA8A5638DCB0C4B523FCA4B54613AE543BDF1" }, { "b" : "7F0ADB7C6000", "path" : "/opt/rh/rh-mongodb36/root/usr/lib64/libboost_program_options.so.rh-mongodb36-1.60.0", "elfType" : 3, "buildId" : "FEEEC1BDA6987D52F313877C6C4F216E4967C751" }, { "b" : "7F0ADB5AF000", "path" : "/opt/rh/rh-mongodb36/root/usr/lib64/libboost_filesystem.so.rh-mongodb36-1.60.0", "elfType" : 3, "buildId" : "42F19F782EF63CE1C7FEEEA0FF8A34129359AE14" }, { "b" : "7F0ADB3AB000", "path" : "/opt/rh/rh-mongodb36/root/usr/lib64/libboost_system.so.rh-mongodb36-1.60.0", "elfType" : 3, "buildId" : "57C98F0E18CF8811AEED5ABF8514691E3998D539" }, { "b" : "7F0ADB193000", "path" : "/opt/rh/rh-mongodb36/root/usr/lib64/libboost_iostreams.so.rh-mongodb36-1.60.0", "elfType" : 3, "buildId" : "269307C83AA2DA328D8FA46D78B010D9C333904B" }, { "b" : "7F0ADAD9E000", "path" : "/opt/rh/rh-mongodb36/root/usr/lib64/libtcmalloc.so.rh-mongodb36-4", "elfType" : 3, "buildId" : "3988EC196424AFCB90C4470704FB84B0DA4AE5E6" }, { "b" : "7F0ADAB85000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "4C488F6E7044BB966162C1F7081ABBA6EBB2B485" }, { "b" : "7F0ADA913000", "path" : "/lib64/libssl.so.10", "elfType" : 3, "buildId" : "AEF5E6F2240B55F90E9DF76CFBB8B9D9F5286583" }, { "b" : "7F0ADA4B2000", "path" : "/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "8BD89856B64DD5189BF075EF574EDF203F93D44A" }, { "b" : "7F0ADA2AA000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "EFDE2029C9A4A20BE5B8D8AE7E6551FF9B5755D2" }, { "b" : "7F0ADA0A6000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "67AD3498AC7DE3EAB952A243094DF5C12A21CD7D" }, { "b" : "7F0AD9D9F000", "path" : "/lib64/libstdc++.so.6", "elfType" : 3, "buildId" : "A357BA16E3E88D692E5D0397B975693A4A382BE1" }, { "b" : "7F0AD9A9D000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "918D3696BF321AA8D32950AB2AB8D0F1B21AC907" }, { "b" : "7F0AD9887000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "6B4F3D896CD0F06FCB3DEF0245F204ECE3220D7E" }, { "b" : "7F0AD966B000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "3D9441083D079DC2977F1BD50C8068D11767232D" }, { "b" : "7F0AD929E000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "3C61131D1DAC9DA79B73188E7702BEF786C2AD54" }, { "b" : "7F0AD908E000", "path" : "/lib64/libbz2.so.1", "elfType" : 3, "buildId" : "0C85C0386F0CF41EA39969CF7F58A558D1AD3235" }, { "b" : "7F0ADC799000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "5DA2D47925497B2F5875A7D8D1799A1227E2FDE4" }, { "b" : "7F0AD8E41000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "B5C83BDE7ED7026835B779FA0F957FCCCD599F40" }, { "b" : "7F0AD8B58000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "8B63976509135BA73A12153D6FDF7B3B9E5D2A54" }, { "b" : "7F0AD8954000", "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "B4BE1023D9606A88169DF411BF94AF417D7BA1A0" }, { "b" : "7F0AD8739000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "6183129B5F29CA14580E517DF94EF317761FA6C9" }, { "b" : "7F0AD852A000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "98F619035053EF68358099CE7CF1AA528B3B229D" }, { "b" : "7F0AD8326000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "2E01D5AC08C1280D013AAB96B292AC58BC30A263" }, { "b" : "7F0AD80FF000", "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "903A0BD0BFB4FEE8C284F41BEB9773DED94CBC52" } ] }}
mongod(+0x1B12F1A) [0x5579227f7f1a]
mongod(+0x1B1227E) [0x5579227f727e]
mongod(+0x1B128CC) [0x5579227f78cc]
mongod(+0xC4B726) [0x557921930726]
libpthread.so.0(+0xF5D0) [0x7f0ad967a5d0]
mongod(+0x111F2B0) [0x557921e042b0]
mongod(+0x110A856) [0x557921def856]
mongod(+0x112495B) [0x557921e0995b]
mongod(+0x112502D) [0x557921e0a02d]
mongod(+0x1125C70) [0x557921e0ac70]
mongod(+0xF06416) [0x557921beb416]
mongod(+0x1114CFA) [0x557921df9cfa]
mongod(+0xF44C75) [0x557921c29c75]
mongod(+0xF459CD) [0x557921c2a9cd]
mongod(+0xE59619) [0x557921b3e619]
mongod(+0xE5C532) [0x557921b41532]
mongod(+0xE64CDD) [0x557921b49cdd]
mongod(+0xE6614D) [0x557921b4b14d]
mongod(+0xE76819) [0x557921b5b819]
??? [0x3ad7ed2fa26d]
----- END BACKTRACE -----
I will try to reproduce it with mongo:3.6.21
and changed command.
This is a major issue for us and we need a fix or workaround this week........
Because it looks like it is connected to the function stats could you pinpoint me to lines in the noobaa-core code where the stats are being written? I'd like to comment them out and try if the error still occurs. This would be a sufficient workaround for us for now. Rather no stats and a working noobaa instance than stats and a instable noobaa instance..
with mongo:3.6.21
it looks way better, 30k events so far and no seg fault! thanks @guymguym !
After testing further with hit a barrier at 60k events. Everything works fine until you try to visualize the events via the UI. This then renders the noobaa-db pod unusable as it hits its 2 cores limit constantly and tries to map reduce all the func_stats. This leads to other calls timing out because of the mongodb not returning anything. I now disabled the func_stats by commenting out this: https://github.com/noobaa/noobaa-core/blob/0afe88a9d0ccdab9e992c889bd80762ab37de81c/src/server/func_services/func_stats_store.js#L40-L48
Noobaa is now very responsive even when all of our 10 endpoints hit their maximum of 8 cores constantly. Please consider fixing the way the stats are calculated, because we really like the feature, but if it creates performance issues we will keep it disabled..
@QuintinBecker @lallinger-arbeit
I've noticed that the mapReduce isn't scanning an index.
Please try adding the following index to the collection and re-run your mapReduce queries.
db.func_stats.createIndex({ system: 1, func: 1, time: 1 });
Let me know if that helped.
Thank you.
@jeniawhite i just tested it with a fresh noobaa instance and unfortunately it does not seem to improve the performance and the same problems as seen before persist..
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.