noobaa-core IO upload fails when worker node where Noobaa core pod runs is shutdown

IO upload fails when worker node where Noobaa core pod runs is shutdown

Open rkomandu opened this issue 2 years ago • 20 comments

Environment info

NooBaa Version: VERSION
Platform: Kubernetes 1.14.1 | minikube 1.1.1 | OpenShift 4.1 | other: specify

oc version Client Version: 4.9.5 Server Version: 4.9.5 Kubernetes Version: v1.22.0-rc.0+a44d0f0

oc get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.9.2 NooBaa Operator 4.9.2 mcg-operator.v4.9.1 Succeeded ocs-operator.v4.9.2 OpenShift Container Storage 4.9.2 ocs-operator.v4.9.1 Succeeded odf-operator.v4.9.2 OpenShift Data Foundation 4.9.2 odf-operator.v4.9.1 Succeeded

ODF:4.9.2-9 build

noobaa status INFO[0000] CLI version: 5.9.2 INFO[0000] noobaa-image: quay.io/rhceph-dev/odf4-mcg-core-rhel8@sha256:5507f2c1074bfb023415f0fef16ec42fbe6e90c540fc45f1111c8c929e477910 INFO[0000] operator-image: quay.io/rhceph-dev/odf4-mcg-rhel8-operator@sha256:b314ad9f15a10025bade5c86857a7152c438b405fdba26f64826679a5c5bff1b INFO[0000] noobaa-db-image: quay.io/rhceph-dev/rhel8-postgresql-12@sha256:623bdaa1c6ae047db7f62d82526220fac099837afd8770ccc6acfac4c7cff100 INFO[0000] Namespace: openshift-storage

Actual behavior

IO was spawned from 3 concurrent users (50G/40G/30G) in the background on to their individual buckets and then the worker1 node was shutdown. It was running noobaa-core pod at the time of shutdown and the pod moved to worker2 node, however the IO failed as shown below

grep upload /tmp/noobaa-core-worker1.down.04Feb2022.log upload failed: ../dd_file_40G to s3://newbucket-u5300-01feb/dd_file_40G An error occurred (InternalError) when calling the UploadPart operation (reached max retries: 2): We encountered an internal error. Please try again. upload failed: ../dd_file_30G to s3://newbucket-u5302-01feb/dd_file_30G An error occurred (InternalError) when calling the UploadPart operation (reached max retries: 2): We encountered an internal error. Please try again. upload failed: ../dd_file_50G to s3://newbucket-u5301-01feb/dd_file_50G An error occurred (InternalError) when calling the UploadPart operation (reached max retries: 2): We encountered an internal error. Please try again.

urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host '10.17.127.180'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings upload failed: ../dd_file_30G to s3://newbucket-u5302-01feb/dd_file_30G An error occurred (InternalError) when calling the UploadPart operation (reached max retries: 2): We encountered an internal error. Please try again. .. urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host '10.17.127.179'. Adding certificate verification is str ongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host '10.17.127.179'. Adding certificate verification is str ongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings upload failed: ../dd_file_50G to s3://newbucket-u5301-01feb/dd_file_50G An error occurred (InternalError) when calling the UploadPart operation (reached max retries: 2): We encountered an internal error. Please try again.---------

Steps to reproduce

Run Concurrent IO from 3 nodes onto the individual buckets

AWS_ACCESS_KEY_ID=vCvYu1lY0AfMTJZ5n9HB AWS_SECRET_ACCESS_KEY=LFHnnQsxxS0iXOS4eDkNU1K7x1IfYG8CtgrvIsin aws --endpoint https://10.17.127.178 --no-verify-ssl s3 cp /root/dd_file_40G s3://newbucket-u5300-01feb & AWS_ACCESS_KEY_ID=mdTnAsuzireuISl5DFXO AWS_SECRET_ACCESS_KEY=y0UvsBKs+R4FFez+FtV/tqT7e+hSToizQqPApGog aws --endpoint https://10.17.127.179 --no-verify-ssl s3 cp /root/dd_file_50G s3://newbucket-u5301-01feb & AWS_ACCESS_KEY_ID=DDZVAUjYrCCODgg7sCbZ AWS_SECRET_ACCESS_KEY=ku5QVHRa45O/XM+z2kRwHtLtIOh1J64dyPa6Ig9b aws --endpoint https://10.17.127.180 --no-verify-ssl s3 cp /root/dd_file_30G s3://newbucket-u5302-01feb &

Noobaa pods are as follows

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES noobaa-core-0 1/1 Running 0 15h 10.254.14.166 worker1.rkomandu-ta.cp.fyre.ibm.com noobaa-db-pg-0 1/1 Running 0 2d22h 10.254.23.217 worker2.rkomandu-ta.cp.fyre.ibm.com noobaa-default-backing-store-noobaa-pod-77176233 1/1 Running 0 3d3h 10.254.18.15 worker0.rkomandu-ta.cp.fyre.ibm.com noobaa-endpoint-7bdd48fccb-8cjcn 1/1 Running 0 179m 10.254.16.88 worker0.rkomandu-ta.cp.fyre.ibm.com noobaa-endpoint-7bdd48fccb-f7llk 1/1 Running 0 178m 10.254.14.167 worker1.rkomandu-ta.cp.fyre.ibm.com noobaa-endpoint-7bdd48fccb-gwtqr 1/1 Running 0 178m 10.254.20.132 worker2.rkomandu-ta.cp.fyre.ibm.com noobaa-operator-54877b7dc9-zjsvl 1/1 Running 0 2d23h 10.254.18.86 worker0.rkomandu-ta.cp.fyre.ibm.com ocs-metrics-exporter-7955bfc785-cn2zl 1/1 Running 0 2d23h 10.254.18.84 worker0.rkomandu-ta.cp.fyre.ibm.com ocs-operator-57d785c8c7-bqpfl 1/1 Running 16 (6h51m ago) 2d23h 10.254.18.90 worker0.rkomandu-ta.cp.fyre.ibm.com odf-console-756c9c8bc7-4jsfl 1/1 Running 0 2d23h 10.254.18.88 worker0.rkomandu-ta.cp.fyre.ibm.com odf-operator-controller-manager-89746b599-z64f6 2/2 Running 16 (9h ago) 2d23h 10.254.18.87 worker0.rkomandu-ta.cp.fyre.ibm.com rook-ceph-operator-74864f7c6f-rlf6c 1/1 Running 0 2d23h 10.254.18.82 worker0.rkomandu-ta.cp.fyre.ibm.com

Worker1 was shutdown [[email protected] ~]# oc get nodes NAME STATUS ROLES AGE VERSION master0.rkomandu-ta.cp.fyre.ibm.com Ready master 56d v1.22.0-rc.0+a44d0f0 master1.rkomandu-ta.cp.fyre.ibm.com Ready master 56d v1.22.0-rc.0+a44d0f0 master2.rkomandu-ta.cp.fyre.ibm.com Ready master 56d v1.22.0-rc.0+a44d0f0 worker0.rkomandu-ta.cp.fyre.ibm.com Ready worker 56d v1.22.0-rc.0+a44d0f0 worker1.rkomandu-ta.cp.fyre.ibm.com Ready worker 56d v1.22.0-rc.0+a44d0f0 worker2.rkomandu-ta.cp.fyre.ibm.com Ready worker 56d v1.22.0-rc.0+a44d0f0

Making the worker1 down where noobaa-core is running

Noobaa core pod is moving to worker2 NAME READY STATUS RESTARTS AGE IP NODE N OMINATED NODE READINESS GATES noobaa-core-0 0/1 ContainerCreating 0 1s worker2.rkomandu-ta.cp.fyre.ibm.com < none> noobaa-db-pg-0 1/1 Running 0 2d22h 10.254.23.217 worker2.rkomandu-ta.cp.fyre.ibm.com < none> noobaa-default-backing-store-noobaa-pod-77176233 1/1 Running 0 3d3h 10.254.18.15 worker0.rkomandu-ta.cp.fyre.ibm.com < none> noobaa-endpoint-7bdd48fccb-8cjcn 1/1 Running 0 3h14m 10.254.16.88 worker0.rkomandu-ta.cp.fyre.ibm.com < none> noobaa-endpoint-7bdd48fccb-gwtqr 1/1 Running 0 3h13m 10.254.20.132 worker2.rkomandu-ta.cp.fyre.ibm.com < none> noobaa-endpoint-7bdd48fccb-hjh9r 0/1 Pending 0 1s < none> noobaa-operator-54877b7dc9-zjsvl 1/1 Running 0 2d23h 10.254.18.86 worker0.rkomandu-ta.cp.fyre.ibm.com < none> ocs-metrics-exporter-7955bfc785-cn2zl 1/1 Running 0 2d23h 10.254.18.84 worker0.rkomandu-ta.cp.fyre.ibm.com < none> ocs-operator-57d785c8c7-bqpfl 1/1 Running 16 (7h6m ago) 2d23h 10.254.18.90 worker0.rkomandu-ta.cp.fyre.ibm.com < none> odf-console-756c9c8bc7-4jsfl 1/1 Running 0 2d23h 10.254.18.88 worker0.rkomandu-ta.cp.fyre.ibm.com < none> odf-operator-controller-manager-89746b599-z64f6 2/2 Running 16 (10h ago) 2d23h 10.254.18.87 worker0.rkomandu-ta.cp.fyre.ibm.com < none> rook-ceph-operator-74864f7c6f-rlf6c 1/1 Running 0 2d23h 10.254.18.82 worker0.rkomandu-ta.cp.fyre.ibm.com < none>
oc get nodes [[email protected] ~]# oc get nodes NAME STATUS ROLES AGE VERSION master0.rkomandu-ta.cp.fyre.ibm.com Ready master 56d v1.22.0-rc.0+a44d0f0 master1.rkomandu-ta.cp.fyre.ibm.com Ready master 56d v1.22.0-rc.0+a44d0f0 master2.rkomandu-ta.cp.fyre.ibm.com Ready master 56d v1.22.0-rc.0+a44d0f0 worker0.rkomandu-ta.cp.fyre.ibm.com Ready worker 56d v1.22.0-rc.0+a44d0f0 worker1.rkomandu-ta.cp.fyre.ibm.com NotReady worker 56d v1.22.0-rc.0+a44d0f0 worker2.rkomandu-ta.cp.fyre.ibm.com Ready worker 56d v1.22.0-rc.0+a44d0f0
noobaa core pod is Running on worker2 (migrated from worker1 node)

Every 3.0s: oc get pods -n openshift-storage -o wide api.rkomandu-ta.cp.fyre.ibm.com: Fri Feb 4 01:08:28 2022

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED N ODE READINESS GATES noobaa-core-0 1/1 Running 0 60s 10.254.20.168 worker2.rkomandu-ta.cp.fyre.ibm.com noobaa-db-pg-0 1/1 Running 0 2d22h 10.254.23.217 worker2.rkomandu-ta.cp.fyre.ibm.com noobaa-default-backing-store-noobaa-pod-77176233 1/1 Running 0 3d3h 10.254.18.15 worker0.rkomandu-ta.cp.fyre.ibm.com noobaa-endpoint-7bdd48fccb-8cjcn 1/1 Running 0 3h15m 10.254.16.88 worker0.rkomandu-ta.cp.fyre.ibm.com noobaa-endpoint-7bdd48fccb-gwtqr 1/1 Running 0 3h14m 10.254.20.132 worker2.rkomandu-ta.cp.fyre.ibm.com noobaa-endpoint-7bdd48fccb-hjh9r 0/1 Pending 0 60s noobaa-operator-54877b7dc9-zjsvl 1/1 Running 0 2d23h 10.254.18.86 worker0.rkomandu-ta.cp.fyre.ibm.com ocs-metrics-exporter-7955bfc785-cn2zl 1/1 Running 0 2d23h 10.254.18.84 worker0.rkomandu-ta.cp.fyre.ibm.com ocs-operator-57d785c8c7-bqpfl 1/1 Running 16 (7h7m ago) 2d23h 10.254.18.90 worker0.rkomandu-ta.cp.fyre.ibm.com odf-console-756c9c8bc7-4jsfl 1/1 Running 0 2d23h 10.254.18.88 worker0.rkomandu-ta.cp.fyre.ibm.com odf-operator-controller-manager-89746b599-z64f6 2/2 Running 16 (10h ago) 2d23h 10.254.18.87 worker0.rkomandu-ta.cp.fyre.ibm.com rook-ceph-operator-74864f7c6f-rlf6c 1/1 Running 0 2d23h 10.254.18.82 worker0.rkomandu-ta.cp.fyre.ibm.com

Upload fails as shown above

Expected behavior

Upload shouldn't fail as the IO is able to service via the MetalLB IP's (HA available) and the IO should complete

Noobaa-endpoint log snippet from worker0 (remember: worker1 node was made down) 
----------------------------------------------------------------------------------------------------
[32mFeb-4 9:18:06.793[35m [Endpoint/14] [31m[ERROR][39m CONSOLE:: Error: Warning stuck surround_count item
    at Semaphore.surround_count (/root/node_modules/noobaa-core/src/util/semaphore.js:86:29)
    at async NamespaceFS.upload_multipart (/root/node_modules/noobaa-core/src/sdk/namespace_fs.js:831:30)
    at async Object.put_object_uploadId [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/ops/s3_put_object_uploadId.js:31:17)
    at async handle_request (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:149:19)
    at async Object.s3_rest [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:68:9)
[32mFeb-4 9:18:06.794[35m [Endpoint/14] [31m[ERROR][39m CONSOLE:: Error: Warning stuck surround_count item
    at Semaphore.surround_count (/root/node_modules/noobaa-core/src/util/semaphore.js:86:29)
    at async NamespaceFS.upload_multipart (/root/node_modules/noobaa-core/src/sdk/namespace_fs.js:831:30)
    at async Object.put_object_uploadId [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/ops/s3_put_object_uploadId.js:31:17)
    at async handle_request (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:149:19)
    at async Object.s3_rest [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:68:9)
[32mFeb-4 9:18:06.794[35m [Endpoint/14] [31m[ERROR][39m CONSOLE:: Error: Warning stuck surround_count item
    at Semaphore.surround_count (/root/node_modules/noobaa-core/src/util/semaphore.js:86:29)
    at async NamespaceFS.upload_multipart (/root/node_modules/noobaa-core/src/sdk/namespace_fs.js:831:30)
    at async Object.put_object_uploadId [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/ops/s3_put_object_uploadId.js:31:17)
    at async handle_request (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:149:19)
    at async Object.s3_rest [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:68:9)
[32mFeb-4 9:18:06.993[35m [Endpoint/14] [31m[ERROR][39m CONSOLE:: Error: Warning stuck surround_count item
    at Semaphore.surround_count (/root/node_modules/noobaa-core/src/util/semaphore.js:86:29)
    at async NamespaceFS.upload_multipart (/root/node_modules/noobaa-core/src/sdk/namespace_fs.js:831:30)
    at async Object.put_object_uploadId [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/ops/s3_put_object_uploadId.js:31:17)
    at async handle_request (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:149:19)
    at async Object.s3_rest [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:68:9)
[32mFeb-4 9:18:07.115[35m [Endpoint/14] [31m[ERROR][39m CONSOLE:: Error: Warning stuck surround_count item
    at Semaphore.surround_count (/root/node_modules/noobaa-core/src/util/semaphore.js:86:29)
    at async NamespaceFS.upload_multipart (/root/node_modules/noobaa-core/src/sdk/namespace_fs.js:831:30)
    at async Object.put_object_uploadId [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/ops/s3_put_object_uploadId.js:31:17)
    at async handle_request (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:149:19)
    at async Object.s3_rest [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:68:9)
[32mFeb-4 9:18:07.276[35m [Endpoint/14] [31m[ERROR][39m CONSOLE:: Error: Warning stuck surround_count item
    at Semaphore.surround_count (/root/node_modules/noobaa-core/src/util/semaphore.js:86:29)
    at async NamespaceFS.upload_multipart (/root/node_modules/noobaa-core/src/sdk/namespace_fs.js:831:30)
    at async Object.put_object_uploadId [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/ops/s3_put_object_uploadId.js:31:17)
    at async handle_request (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:149:19)
    at async Object.s3_rest [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:68:9)
[32mFeb-4 9:18:07.336[35m [Endpoint/14] [31m[ERROR][39m CONSOLE:: Error: Warning stuck surround_count item
    at Semaphore.surround_count (/root/node_modules/noobaa-core/src/util/semaphore.js:86:29)
    at async NamespaceFS.upload_multipart (/root/node_modules/noobaa-core/src/sdk/namespace_fs.js:831:30)
    at async Object.put_object_uploadId [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/ops/s3_put_object_uploadId.js:31:17)
    at async handle_request (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:149:19)
    at async Object.s3_rest [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:68:9)
[32mFeb-4 9:18:08.107[35m [Endpoint/14] [31m[ERROR][39m CONSOLE:: Error: Warning stuck surround_count item
    at Semaphore.surround_count (/root/node_modules/noobaa-core/src/util/semaphore.js:86:29)
    at async NamespaceFS.upload_multipart (/root/node_modules/noobaa-core/src/sdk/namespace_fs.js:831:30)
    at async Object.put_object_uploadId [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/ops/s3_put_object_uploadId.js:31:17)
    at async handle_request (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:149:19)
    at async Object.s3_rest [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:68:9)
[32mFeb-4 9:18:08.307[35m [Endpoint/14] [31m[ERROR][39m CONSOLE:: Error: Warning stuck surround_count item
    at Semaphore.surround_count (/root/node_modules/noobaa-core/src/util/semaphore.js:86:29)
    at async NamespaceFS.upload_multipart (/root/node_modules/noobaa-core/src/sdk/namespace_fs.js:831:30)
    at async Object.put_object_uploadId [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/ops/s3_put_object_uploadId.js:31:17)
    at async handle_request (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:149:19)
    at async Object.s3_rest [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:68:9)
[32mFeb-4 9:18:08.635[35m [Endpoint/14] [31m[ERROR][39m CONSOLE:: Error: Warning stuck surround_count item
    at Semaphore.surround_count (/root/node_modules/noobaa-core/src/util/semaphore.js:86:29)
    at async NamespaceFS.upload_multipart (/root/node_modules/noobaa-core/src/sdk/namespace_fs.js:831:30)
    at async Object.put_object_uploadId [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/ops/s3_put_object_uploadId.js:31:17)
    at async handle_request (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:149:19)
    at async Object.s3_rest [as handler] (/root/node_modules/noobaa-core/src/endpoint/s3/s3_rest.js:68:9)
2022-02-04 09:18:11.796835 [PID-14/TID-14] [L1] FS::FSWorker::Begin: Readdir _path=/nsfs/noobaa-s3res-4080029599
2022-02-04 09:18:11.796976 [PID-14/TID-24] [L1] FS::FSWorker::Execute: Readdir _path=/nsfs/noobaa-s3res-4080029599 _uid=0 _gid=0 _backend=GPFS
2022-02-04 09:18:11.797332 [PID-14/TID-24] [L1] FS::FSWorker::Execute: Readdir _path=/nsfs/noobaa-s3res-4080029599  took: 0.270146 ms
2022-02-04 09:18:11.797409 [PID-14/TID-14] [L1] FS::FSWorker::OnOK: Readdir _path=/nsfs/noobaa-s3res-4080029599
2022-02-04 09:18:11.797557 [PID-14/TID-14] [L1] FS::FSWorker::Begin: Stat _path=/nsfs/noobaa-s3res-4080029599
2022-02-04 09:18:11.797623 [PID-14/TID-23] [L1] FS::FSWorker::Execute: Stat _path=/nsfs/noobaa-s3res-4080029599 _uid=0 _gid=0 _backend=GPFS
2022-02-04 09:18:11.797679 [PID-14/TID-23] [L1] FS::FSWorker::Execute: Stat _path=/nsfs/noobaa-s3res-4080029599  took: 0.01195 ms
2022-02-04 09:18:11.797720 [PID-14/TID-14] [L1] FS::Stat::OnOK: _path=/nsfs/noobaa-s3res-4080029599 _stat_res.st_ino=3 _stat_res.st_size=262144
[32mFeb-4 9:18:11.797[35m [Endpoint/14] [36m   [L0][39m core.server.bg_services.namespace_monitor:: update_last_monitoring: monitoring for noobaa-s3res-4080029599, 61f22b92543779002bee71a5 finished successfully..

Collected MG on the cluster 1 - when node was down (worker1) the "oc adm must-gather" must-gather-collected-when-worker1.down.tar.gz

2 - when node got to Active state (worker1) the "oc adm must-gather"

must-gather-collected-when-worker1.up.tar.gz

More information - Screenshots / Logs / Other output

Feb 04 '22 10:02 rkomandu

Why do you think it's not expected to fail? Core is not HA in the sense that there is more than a single instance. Core is needed for operations as well since the configuration itself resides within the core. If a certain client has tolerance in its operations (retries etc.) for more than the time it takes the core to migrate to worker2 then the app won't fail, if not then it will.

@rkomandu

Feb 07 '22 14:02 nimrod-becker

@nimrod-becker , Looks like we need to document the limitation for this then.

@troppens @Akshat your thoughts on this ?

Feb 07 '22 16:02 rkomandu

Ofc, it makes sense to document this, at least this is how I see it

Feb 07 '22 16:02 nimrod-becker

@rkomandu - I recall that Spark has a retry counter and a retry timeout. This allows to control resiliency against storage issues. I therefore concur with @nimrod-becker.

Feb 07 '22 16:02 troppens

Hi @nimrod-becker , @troppens , I too agree that it needs to be documented.

Is there a recommendation that we can provide about setting up a retry count and timeout values or shall we keep it open as generic statement as it might defer based on DAN node config?

Feb 08 '22 05:02 akmithal

@baum Do we have ballpark numbers from our testing regarding time of core/db rescheduling on a new node after their node was shutdown?

Feb 08 '22 06:02 nimrod-becker

@nimrod-becker, for pods without PVs like core, the reschedule happens once the API Server marks the node as NOTREADY which happens 1 minute after the node failure. For pods with PVs like db, there is additional storage system overhead of detaching the PV from the failing node and attaching it to the new node. This overhead depends on storage system implementation.

Feb 08 '22 07:02 baum

@baum Do we have ballpark numbers from our testing regarding time of core/db rescheduling on a new node after their node was shutdown?

@nimrod-becker , in my ENV testing on Fyre (virtual system) not with too heavy load, it was uploading 3 files(50/40/30G) to 3 different user buckets, noobaa-db-pg-0 took about 6mXsec as per my post in CSI defect (563).

Noobaa core from the first comment in this issue, took about 1min in my case.

Feb 08 '22 09:02 rkomandu

Tried with the nooba-core pod down scenario while running Warp from 3 concurrent users onto their individual buckets.

ODF - 4.9.5-4 (d/s builds) + HPO latest code

NAME                                               READY   STATUS    RESTARTS       AGE   IP             NODE                                   NOMINATED NODE   READI
NESS GATES
noobaa-core-0                                      1/1     Running   0              89m   10.254.17.37   worker0.rkomandu-513.cp.fyre.ibm.com   <none>           <none
>
noobaa-db-pg-0                                     1/1     Running   0              21h   10.254.20.44   worker2.rkomandu-513.cp.fyre.ibm.com   <none>           <none
>

Made worker0 down that is running noobaa-core pod 
oc get nodes
worker0.rkomandu-513.cp.fyre.ibm.com   NotReady   worker   15d   v1.22.3+e790d7f

NAME                                              READY   STATUS              RESTARTS        AGE   IP             NODE                                   NOMINATED NO
DE   READINESS GATES
noobaa-core-0                                     1/1     Running             0               20s   10.254.12.24   worker1.rkomandu-513.cp.fyre.ibm.com   <none>
     <none>
noobaa-db-pg-0                                    1/1     Running             0               21h   10.254.20.44   worker2.rkomandu-513.cp.fyre.ibm.com   <none>
     <none>
noobaa-endpoint-64cc4958cd-jlbwl                  0/1     ContainerCreating   0               20s   <none>         worker1.rkomandu-513.cp.fyre.ibm.com   <none>
     <none>
noobaa-endpoint-64cc4958cd-kgzjg                  1/1     Running             0               68m   10.254.12.7    worker1.rkomandu-513.cp.fyre.ibm.com   <none>
     <none>
noobaa-endpoint-64cc4958cd-lwrxb                  1/1     Running             0               21h   10.254.20.7    worker2.rkomandu-513.cp.fyre.ibm.com   <none>
     <none>

Did come across few of these messages on the terminal where Warp is running

3 different users 
------------------------
warp: <ERROR> download error: We encountered an internal error. Please try again.
warp: <ERROR> download error: We encountered an internal error. Please try again.
warp: <ERROR> stat error:  500 Internal Server Error
warp: <ERROR> upload error: We encountered an internal error. Please try again.
warp: <ERROR> download error: We encountered an internal error. Please try again.
warp: <ERROR> upload error: We encountered an internal error. Please try again.
warp: <ERROR> stat error:  500 Internal Server Error
warp: <ERROR> download error: We encountered an internal error. Please try again.
warp: <ERROR> download error: We encountered an internal error. Please try again.
warp: <ERROR> upload error: We encountered an internal error. Please try again.
warp: <ERROR> stat error:  500 Internal Server Error
warp: <ERROR> download error: We encountered an internal error. Please try again.
warp: <ERROR> delete error:  We encountered an internal error. Please try again.
warp: <ERROR> download error: We encountered an internal error. Please try again.


 warp mixed --host=10.17.31.215 --access-key=XY0XGrSOb2JtswIZPuYk --secret-key=LuHCy6v/txlEgnDnlp8VN --obj.size=50M --duration=60m --bucket=newbucket-5007-warp --debug --insecure --tls
warp: <ERROR> download error: We encountered an internal error. Please try again.
warp: <ERROR> stat error:  500 Internal Server Error
warp: <ERROR> download error: We encountered an internal error. Please try again.
warp: <ERROR> download error: We encountered an internal error. Please try again.
warp: <ERROR> delete error:  We encountered an internal error. Please try again.
warp: <ERROR> stat error:  500 Internal Server Error
warp: <ERROR> download error: We encountered an internal error. Please try again.
warp: <ERROR> stat error:  500 Internal Server Error
warp: <ERROR> stat error:  500 Internal Server Error
warp: <ERROR> download error: We encountered an internal error. Please try again.
warp: <ERROR> upload error: We encountered an internal error. Please try again.
warp: <ERROR> download error: We encountered an internal error. Please try again.
warp: <ERROR> download error: We encountered an internal error. Please try again.

warp: <ERROR> download error: Get "https://10.17.26.219/newbucket-5008-warp/qupISkN2/71.WMdIPY5sApcUllsv.rnd": dial tcp 10.17.26.219:443: connect: no route to host
warp: <ERROR> download error: Get "https://10.17.26.219/newbucket-5008-warp/qsuzQ53V/50.Q%28Qp9z66ZkrwCV%28o.rnd": dial tcp 10.17.26.219:443: connect: no route to host
warp: <ERROR> stat error:  Head "https://10.17.26.219/newbucket-5008-warp/rvW4Ao%29x/108.OrHUmgoLdVhGwe9x.rnd": dial tcp 10.17.26.219:443: connect: no route to host
warp: <ERROR> download error: Get "https://10.17.26.219/newbucket-5008-warp/C1VGbSLd/112.HoX%288YIXjV6ywBl0.rnd": dial tcp 10.17.26.219:443: connect: no route to host
warp: <ERROR> download error: Get "https://10.17.26.219/newbucket-5008-warp/ufEeejo0/31.Fwvd55wcbFgHoK5F.rnd": dial tcp 10.17.26.219:443: connect: no route to host
warp: <ERROR> stat error:  Head "https://10.17.26.219/newbucket-5008-warp/0LGDXao8/61.0pPXvRCoVfAxvtBr.rnd": dial tcp 10.17.26.219:443: connect: no route to host
warp: <ERROR> stat error:  Head "https://10.17.26.219/newbucket-5008-warp/cQUoG7Qx/43.fug1hM5eQQE9iVWL.rnd": dial tcp 10.17.26.219:443: connect: no route to host
warp: <ERROR> download error: Get "https://10.17.26.219/newbucket-5008-warp/BPjBSQYB/35.iVw6P65n%28frGAvXR.rnd": dial tcp 10.17.26.219:443: connect: no route to host
warp: <ERROR> download error: Get "https://10.17.26.219/newbucket-5008-warp/jRzhbmEM/118.J%29OmNAL5LvmUkA8P.rnd": dial tcp 10.17.26.219:443: connect: no route to host
warp: <ERROR> download error: Get "https://10.17.26.219/newbucket-5008-warp/hUQFP9Os/120.ISjBItSuxhGKQOvc.rnd": dial tcp 10.17.26.219:443: connect: no route to host
warp: <ERROR> download error: Get "https://10.17.26.219/newbucket-5008-warp/qsuzQ53V/32.OVEWJwOKoBWdknQ6.rnd": dial tcp 10.17.26.219:443: connect: no route to host
warp: <ERROR> download error: Get "https://10.17.26.219/newbucket-5008-warp/cQUoG7Qx/49.ZpVuIdvGwjSPvkP2.rnd": dial tcp 10.17.26.219:443: connect: no route to host
warp: <ERROR> delete error:  Delete "https://10.17.26.219/newbucket-5008-warp/hUQFP9Os/26.l98G4%292dJjB1KA62.rnd": dial tcp 10.17.26.219:443: connect: no route to host
warp: <ERROR> delete error:  Delete "https://10.17.26.219/newbucket-5008-warp/dVuMnyO2/2.V5ZOfiH9ohFz35Wc.rnd": dial tcp 10.17.26.219:443: connect: no route to host
warp: <ERROR> upload error: Put "https://10.17.26.219/newbucket-5008-warp/Aq0ORN7z/66.tB3h090rp94yia%28L.rnd": dial tcp 10.17.26.219:443: connect: no route to host
warp: <ERROR> stat error:  Head "https://10.17.26.219/newbucket-5008-warp/jRzhbmEM/44.FhPaDFD%28%29Tp5bb4c.rnd": dial tcp 10.17.26.219:443: connect: no route to host
warp: <ERROR> stat error:  Head "https://10.17.26.219/newbucket-5008-warp/VYn8%28wR4/4.0LRUlk8HsKgzrKLJ.rnd": dial tcp 10.17.26.219:443: connect: no route to host
warp: <ERROR> download error: Get "https://10.17.26.219/newbucket-5008-warp/C1VGbSLd/31.lfw9nj5uo1vzHCmO.rnd": dial tcp 10.17.26.219:443: connect: no route to host
warp: <ERROR> download error: Get "https://10.17.26.219/newbucket-5008-warp/F5%299l0D4/34.TiDOqAvdoVIM4STL.rnd": dial tcp 10.17.26.219:443: connect: no route to hos

We see a pod label missing on the worker0 noobaa-endpoint since it got restarted. Noobaa and IBM Dev discussed and IBM team will implement the label when the noobaa-endpoint restarts

oc get pods -n openshift-storage --show-labels |grep endpoint noobaa-endpoint-64cc4958cd-kgzjg 1/1 Running 0 105m app=noobaa,ibm-spectrum-scale-das-node=worker1.rkomandu-513.cp.fyre.ibm.com,noobaa-s3=noobaa,pod-template-hash=64cc4958cd noobaa-endpoint-64cc4958cd-lwrxb 1/1 Running 0 22h app=noobaa,ibm-spectrum-scale-das-node=worker2.rkomandu-513.cp.fyre.ibm.com,noobaa-s3=noobaa,pod-template-hash=64cc4958cd noobaa-endpoint-64cc4958cd-nvkhh 1/1 Running 0 16m app=noobaa,noobaa-s3=noobaa,pod-template-hash=64cc4958cd ---> this has missing label of das.

Apr 07 '22 09:04 rkomandu

Let me see if running on Baremetal (in a week or so) with the same Warp would give me better results or not.

Apr 07 '22 09:04 rkomandu

Posting for reference, with the run results on Fyre

[[email protected] ~]# warp mixed --host=10.17.31.215 --access-key=XY0XGrSOb2JtswIZPuYk --secret-key=LuHCy6v/txlEgnDnlp8VN --obj.size=50M --duration=60m --bucket=newbucket-5007-warp --debug --insecure --tls
warp: <ERROR> download error: We encountered an internal error. Please try again.
warp: <ERROR> stat error:  500 Internal Server Error
warp: <ERROR> download error: We encountered an internal error. Please try again.
warp: <ERROR> download error: We encountered an internal error. Please try again.
warp: <ERROR> delete error:  We encountered an internal error. Please try again.
warp: <ERROR> stat error:  500 Internal Server Error
warp: <ERROR> download error: We encountered an internal error. Please try again.
warp: <ERROR> stat error:  500 Internal Server Error
warp: <ERROR> stat error:  500 Internal Server Error
warp: <ERROR> download error: We encountered an internal error. Please try again.
warp: <ERROR> upload error: We encountered an internal error. Please try again.
warp: <ERROR> download error: We encountered an internal error. Please try again.
warp: <ERROR> download error: We encountered an internal error. Please try again.
warp: Benchmark data written to "warp-mixed-2022-04-07[010533]-cofK.csv.zst"
Mixed operations.
Operation: DELETE, 10%, Concurrency: 20, Ran 59m59s.
Errors: 1
 * Throughput: 1.02 obj/s

Operation: GET, 45%, Concurrency: 20, Ran 1h0m0s.
Errors: 7
 * Throughput: 218.60 MiB/s, 4.58 obj/s

Operation: PUT, 15%, Concurrency: 20, Ran 1h0m3s.
Errors: 1
 * Throughput: 72.89 MiB/s, 1.53 obj/s

Operation: STAT, 30%, Concurrency: 20, Ran 59m59s.
Errors: 4
 * Throughput: 3.06 obj/s

Cluster Total: 291.47 MiB/s, 10.19 obj/s, 13 errors over 1h0m0s.
Total Errors:13.
warp: Cleanup Done.

other user warp summary
----------------------------
warp: <ERROR> upload error: Put "https://10.17.26.219/newbucket-5008-warp/JMBjM7Yy/69.brQ0JmKDu5cA5yuy.rnd": dial tcp 10.17.26.219:443: connect: no route to host
warp: Benchmark data written to "warp-mixed-2022-04-07[010625]-rVBZ.csv.zst"
Mixed operations.
Operation: DELETE, 10%, Concurrency: 20, Ran 12m24s.
Errors: 68
 * Throughput: 1.08 obj/s

Operation: GET, 45%, Concurrency: 20, Ran 12m25s.
Errors: 313
 * Throughput: 278.61 MiB/s, 4.87 obj/s

Operation: PUT, 15%, Concurrency: 20, Ran 12m23s.
Errors: 117
 * Throughput: 92.32 MiB/s, 1.61 obj/s

Operation: STAT, 30%, Concurrency: 20, Ran 12m25s.
Errors: 210
 * Throughput: 3.24 obj/s

Cluster Total: 370.39 MiB/s, 10.79 obj/s, 708 errors over 12m25s.
Total Errors:708.
warp: <ERROR> Get "https://10.17.26.219/newbucket-5008-warp/?delimiter=&encoding-type=url&fetch-owner=true&list-type=2&prefix=mUbQh%28SH%2F": dial tcp 10.17.26.219:443: connect: no route to host
warp: Cleanup Done.

Apr 07 '22 09:04 rkomandu

Hi @rkomandu , Shall we close this defect as we have got fix from HPO Development or would you like to keep it until verification of Story 294372 is done.

Apr 26 '22 09:04 akmithal

This was tried on the Compact Cluster configuration of OCP cluster on BM. With the fixing of label for noobaa-endpoint in the DAS component , tried noobaa core pod down scenario and the warp run result showed as follows on the node that was made down. IP has moved from the node to other node and has hit with 4 Errors.

warp mixed --host=10.49.0.111 --access-key=dcKdTNdc77fYBIjcHi3m --secret-key=JF56+c3nE97wL4vv1hM/p9Y0beVsOjeAeKSE0Q6r --obj.size=60M --duration=80m --bucket=newbucket-5101-26apr-noobaacore-only --debug --insecure --tls warp: <ERROR> download error: read tcp 10.49.0.26:32984->10.49.0.111:443: read: connection timed out warp: <ERROR> download error: read tcp 10.49.0.26:33014->10.49.0.111:443: read: connection timed out warp: <ERROR> download error: read tcp 10.49.0.26:33012->10.49.0.111:443: read: connection timed out warp: <ERROR> download error: read tcp 10.49.0.26:33010->10.49.0.111:443: read: connection timed out warp: Benchmark data written to "warp-mixed-2022-04-26[123029]-Kj48.csv.zst" Mixed operations. Operation: DELETE, 10%, Concurrency: 20, Ran 1h19m59s.

Throughput: 1.20 obj/s

Operation: GET, 45%, Concurrency: 20, Ran 1h20m0s. Errors: 4

Throughput: 309.07 MiB/s, 5.40 obj/s

Operation: PUT, 15%, Concurrency: 20, Ran 1h20m1s.

Throughput: 103.01 MiB/s, 1.80 obj/s

Operation: STAT, 30%, Concurrency: 20, Ran 1h19m59s.

Throughput: 3.60 obj/s

Cluster Total: 412.08 MiB/s, 12.00 obj/s, 4 errors over 1h20m0s. Total Errors:4. warp: Cleanup Done

@akmithal , above are the findings w/r/t noobaa-core running on a node when it is made down.

I have tried another scenario where the noobaa-core, noobaa-db are running on same node and tried power on/off and it has exhibited different work flow where there are errors for all 3 users.

@nimrod-becker @baum @romayalon Question: When the noobaa-core, noobaa-db-pg are running on same node, the recovery takes time (noobaa-db-pg for few mins this is documented already), however does noobaa-core need to communicate to noobaa-db when it comes up or the vice versa while IO is running ? As i see good number of errors in Warp on all 3 concurrent runs spawned onto each noobaa-endpoint running on node.

Apr 28 '22 05:04 rkomandu

Attaching the warp errors for the Q asked, there are many errors , why ? this is for the noobaa-db and noobaa-core running on the same node that was made down.

[root@hpo-app11 ip-config]# warp mixed --host=10.49.0.111 --access-key=dcKdTNdc77fYBIjcHi3m --secret-key=JF56+c3nE97wL4vv1hM/p9Y0beVsOjeAeKSE0Q6r -ation=80m --bucket=newbucket-5101-26apr-noobaacore-warp --debug --insecure --tls

warp: Benchmark data written to "warp-mixed-2022-04-26[061118]-YqAy.csv.zst"
Mixed operations.
Operation: DELETE, 10%, Concurrency: 20, Ran 1h19m59s.
Errors: 203
 * Throughput: 1.56 obj/s

Operation: GET, 45%, Concurrency: 20, Ran 1h20m0s.
Errors: 908
 * Throughput: 401.02 MiB/s, 7.01 obj/s

Operation: PUT, 15%, Concurrency: 20, Ran 1h20m1s.
Errors: 300
 * Throughput: 133.70 MiB/s, 2.34 obj/s

Operation: STAT, 30%, Concurrency: 20, Ran 1h19m58s.
Errors: 604
 * Throughput: 4.67 obj/s

Cluster Total: 534.74 MiB/s, 15.58 obj/s, 2015 errors over 1h20m0s.
Total Errors:2015.
warp: Cleanup Done.

[root@hpo-app11 ~]# warp mixed --host=10.49.0.109 --access-key=AxCD0SlR1ild7yupwCo3 --secret-key=SkUBzfi3HbFsRd96Bpd4TOkX34Oc80v1vpPPAdaK --obj.sizm --bucket=newbucket-5103-26apr-noobaacore-warp --debug --insecure --tls

warp: Benchmark data written to "warp-mixed-2022-04-26[060939]-OwkB.csv.zst"
Mixed operations.
Operation: DELETE, 10%, Concurrency: 20, Ran 1h24m59s.
Errors: 91
 * Throughput: 1.21 obj/s

Operation: GET, 45%, Concurrency: 20, Ran 1h25m0s.
Errors: 417
 * Throughput: 260.29 MiB/s, 5.46 obj/s

Operation: PUT, 15%, Concurrency: 20, Ran 1h25m1s.
Errors: 157
 * Throughput: 86.57 MiB/s, 1.82 obj/s

Operation: STAT, 30%, Concurrency: 20, Ran 1h24m59s.
Errors: 280
 * Throughput: 3.64 obj/s

Cluster Total: 346.88 MiB/s, 12.13 obj/s, 945 errors over 1h25m0s.
Total Errors:945.
warp: Cleanup Done


[root@hpo-app11 ~]# warp mixed --host=10.49.0.110 --access-key=qHLmebqDyKIJRjzpHfBN --secret-key=I6thSWhjvCPbFz+ea2rm1hVOk08k0+vT5kA4gPxb --obj.sizm --bucket=newbucket-5102-26apr-nobaacore-warp --debug --insecure --tls
warp: Benchmark data written to "warp-mixed-2022-04-26[060757]-mnkJ.csv.zst"
Mixed operations.
Operation: DELETE, 10%, Concurrency: 20, Ran 1h14m59s.
Errors: 200
 * Throughput: 2.34 obj/s

Operation: GET, 45%, Concurrency: 20, Ran 1h15m0s.
Errors: 887
 * Throughput: 401.84 MiB/s, 10.53 obj/s

Operation: PUT, 15%, Concurrency: 20, Ran 1h15m1s.
Errors: 297
 * Throughput: 133.82 MiB/s, 3.51 obj/s

Operation: STAT, 30%, Concurrency: 20, Ran 1h14m59s.
Errors: 597
 * Throughput: 7.02 obj/s

Cluster Total: 535.65 MiB/s, 23.41 obj/s, 1981 errors over 1h15m0s.
Total Errors:1981.
warp: Cleanup Done.


`


```

Apr 28 '22 09:04 rkomandu

@nimrod-becker @baum @romayalon FYI , the noobaa-core pod running node when made down it takes about 1min or so to get it up which we have discussed earlier also. When noobaa-core pod has come back to Running state as it has Active noobaa-db pod , there are only 4 errors from warp as posted in my recent comment.

In the Question that have asked , when noobaa-core, noobaa-db run on same node and if that node is power off, noobaa-db pod takes about 6m Xsec and till that time as the noobaa-core can't communicate since it also got restarted are the Warp errors on all 3 nodes running is due to that, are they expected ?

@troppens @akmithal

Apr 28 '22 09:04 rkomandu

Core needs the db both when it comes up and also during regular work.

Since we see the core is being rescheduled as expected, os there a bug here or can we close ?

Apr 28 '22 11:04 nimrod-becker

@nimrod-becker , So we are going to loose about 6m Xsec in that case as noobaa-db has the currently limitation on the CSI for now.

@troppens
We should document this especially as part of the node down scenario, that there would be loss to the IO for few minutes due to current limitations. Any thoughts ?

Apr 28 '22 12:04 rkomandu

Core needs the db both when it comes up and also during regular work.

Since we see the core is being rescheduled as expected, os there a bug here or can we close ?

@nimrod-becker , you are saying when the noobaa-core, noobaa-db are on same node, that node went down, the noobaa-core come back in a min or 1min+ , however the noobaa-db takes about 6mXsec for now due to the underlying infrastructure.

Till the noobaa-core communicate with noobaa-db, for the currently running IO we would see the errors is what you are saying , Is my understanding correct ?

Apr 28 '22 13:04 rkomandu

@troppens We should document this especially as part of the node down scenario, that there would be loss to the IO for few minutes due to current limitations. Any thoughts ?

Yes, we should document this.

Apr 28 '22 14:04 troppens

@troppens , opened this https://github.ibm.com/IBMSpectrumScale/hpo-core/issues/702

@nimrod-becker , wanted to wait for other team for doing these tests.

May 02 '22 11:05 rkomandu

noobaa-core noobaa-core copied to clipboard

IO upload fails when worker node where Noobaa core pod runs is shutdown

Environment info

Actual behavior

Expected behavior

More information - Screenshots / Logs / Other output

noobaa-core
noobaa-core copied to clipboard