noobaa-core icon indicating copy to clipboard operation
noobaa-core copied to clipboard

Enable support for Memory in HPA view

Open akmithal opened this issue 3 years ago • 11 comments

Environment info

  • NooBaa Version: master-20210926
  • Platform: OCP 4.8.9

Actual behavior

  1. Download of large no. of files failed with error

Expected behavior

  1. Download shouldn't fail

Steps to reproduce

  1. In order to attempt verification of defect 6707, found this new set of errors.

From app-server, got error:

download failed: s3://bucket-3/obj_112983 to ./obj_112983 An error occurred (504) when calling the GetObject operation: Gateway Timeout

From endpoint logs:

Sep-30 6:09:18.035 [Endpoint/13] [ERROR] core.endpoint.s3.s3_rest:: S3 ERROR <?xml version="1.0" encoding="UTF-8"?><Error><Code>SlowDown</Code><Message>Reduce your request rate.</Message><Resource>
/bucket-3/obj_112711</Resource><RequestId>ku6jccni-1vil04-1bef</RequestId></Error> GET /bucket-3/obj_112711 {"accept-encoding":"identity","user-agent":"aws-cli/2.1.29 Python/3.8.8 Linux/4.18.0-305.
19.1.el8_4.x86_64 exe/x86_64.rhel.8 prompt/off command/s3.sync","x-amz-date":"20210930T060718Z","x-amz-content-sha256":"e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855","authorizat
ion":"AWS4-HMAC-SHA256 Credential=BrVzxXwyhi470cqA3s0c/20210930/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=e49d840a75a56c9a982db4f536a8e48fdd3997c44b0e
4742a8c23e6de867abfd","host":"s3-noobaa.apps.ocp-akshat-1.cp.fyre.ibm.com","x-forwarded-host":"s3-noobaa.apps.ocp-akshat-1.cp.fyre.ibm.com","x-forwarded-port":"443","x-forwarded-proto":"https","for
warded":"for=10.17.66.193;host=s3-noobaa.apps.ocp-akshat-1.cp.fyre.ibm.com;proto=https","x-forwarded-for":"10.17.66.193"} Error: Semaphore Timeout
    at Semaphore._on_timeout (/root/node_modules/noobaa-core/src/util/semaphore.js:233:25)
    at Timeout.<anonymous> (/root/node_modules/noobaa-core/src/util/semaphore.js:229:53)
    at listOnTimeout (internal/timers.js:557:17)
    at processTimers (internal/timers.js:500:7)
Sep-30 6:09:18.060 [Endpoint/13]    [L0] core.endpoint.s3.ops.s3_get_object:: request aborted: undefined
Sep-30 6:09:18.068 [Endpoint/13]    [L0] core.sdk.object_sdk:: validate_non_nsfs_bucket:  { name: SENSITIVE-7fa8f881394f8a6d, email: SENSITIVE-3d288e97f8a66631, access_keys: [ { access_key: SENSITI
VE-d22de9a58fe9d068, secret_key: SENSITIVE-ebf6e1ecaa99829a } ], has_login: false, has_s3_access: true, allowed_buckets: { full_permission: true }, default_resource: 'nsfs-nsr-1', can_create_bucket
s: true, nsfs_account_config: { gid: 800, uid: 5000, nsfs_only: true, new_buckets_path: '/', backend: 'GPFS', warn_threshold_ms: 100 }, systems: [ { name: 'noobaa', roles: [ 'admin' ] } ], external
_connections: { count: 0, connections: [] }, preferences: { ui_theme: 'DARK' } } { fs_root_path: '/nsfs/nsfs-nsr-1', fs_backend: 'GPFS', id: '6151b1d55fd0290029f00859', name: 'nsfs-nsr-1' }
Sep-30 6:09:18.068 [Endpoint/13]    [L1] core.endpoint.s3.s3_rest:: S3 REQUEST GET /bucket-3/obj_113071 op get_object request_id ku6jex9w-2hugkd-14z7 { 'accept-encoding': 'identity', 'user-agent': 
'aws-cli/2.1.29 Python/3.8.8 Linux/4.18.0-305.19.1.el8_4.x86_64 exe/x86_64.rhel.8 prompt/off command/s3.sync', 'x-amz-date': '20210930T060918Z', 'x-amz-content-sha256': 'e3b0c44298fc1c149afbf4c8996
fb92427ae41e4649b934ca495991b7852b855', authorization: 'AWS4-HMAC-SHA256 Credential=BrVzxXwyhi470cqA3s0c/20210930/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Sign
ature=e3c108699b6e59bcfa59fe7a0f987bce0fc6093e6e4bd342a9b5e6befac0a90d', host: 's3-noobaa.apps.ocp-akshat-1.cp.fyre.ibm.com', 'x-forwarded-host': 's3-noobaa.apps.ocp-akshat-1.cp.fyre.ibm.com', 'x-f
orwarded-port': '443', 'x-forwarded-proto': 'https', forwarded: 'for=10.17.66.193;host=s3-noobaa.apps.ocp-akshat-1.cp.fyre.ibm.com;proto=https', 'x-forwarded-for': '10.17.66.193' }
Sep-30 6:09:18.070 [Endpoint/13]    [L1] core.endpoint.s3.ops.s3_get_object:: reading object undefined 512
Sep-30 6:09:18.070 [Endpoint/13]    [L0] core.sdk.namespace_fs:: NamespaceFS: read_object_stream { file_path: '/nsfs/nsfs-nsr-1/bucket-3/obj_113071', start: 0, end: Infinity }
Sep-30 6:09:18.071 [Endpoint/13]    [L1] core.util.buffer_utils:: BufferPool.get_buffer BuffersPool {
  buf_size: 8388608,
  buffers: [],
  sem: Semaphore {
    _initial: 536870912,
    _value: 0,
    _waiting_value: 3347054592,
    _wq: WaitQueue {
      _name: '_wq_',
      _q: <ref *2> LinkedList {
        next: '_lln_',
        prev: '_llp_',
        head: '_llh_',
        length: 399,
        _lln_: <ref *7> {
          count: 8388608,
          time: 1632982038135,
          _wq_: Defer { isPending: true, isResolved: false, isRejected: false, _promise_resolve: [Function (anonymous)], _promise_reject: [Function (anonymous)], promise: Promise { <pending> } },
          _lln_: <ref *6> {
            count: 8388608,
            time: 1632982057848,
            _wq_: Defer { isPending: true, isResolved: false, isRejected: false, _promise_resolve: [Function (anonymous)], _promise_reject: [Function (anonymous)], promise: Promise { <pending> } },
            _lln_: <ref *5> {
              count: 8388608,
              time: 1632982058049,
              _wq_: Defer { isPending: true, isResolved: false, isRejected: false, _promise_resolve: [Function (anonymous)], _promise_reject: [Function (anonymous)], promise: Promise { <pending> } },
              _lln_: <ref *4> {
                count: 8388608,
                time: 1632982058158,
                _wq_: Defer { isPending: true, isResolved: false, isRejected: false, _promise_resolve: [Function (anonymous)], _promise_reject: [Function (anonymous)], promise: Promise { <pending> } },
                _lln_: <ref *3> { count: 8388608, time: 1632982058163, _wq_: Defer { isPending: true, isResolved: false, isRejected: false, _promise_resolve: [Function (anonymous)], _promise_reject: [Function (anonymous)], promise: Promise { <pending> } }, _lln_: <ref *1> { count: 8388608, time: 1632982058180, _wq_: Defer { isPending: true, isResolved: false, isRejected: false, _promise_resolve: [Function (anonymous)], _promise_reject: [Function (anonymous)], promise: [Promise] }, _lln_: { count: 8388608, time: 1632982058528, _wq_: [Defer], _lln_: [Object], _llp_: [Circular *1], _llh_: [Circular *2] }, _llp_: [Circular *3], _llh_: [Circular *2] }, _llp_: [Circular *4], _llh_: [Circular *2] },
                _llp_: [Circular *5],
                _llh_: [Circular *2]
              },
              _llp_: [Circular *6],
              _llh_: [Circular *2]
            },
            _llp_: [Circular *7],
            _llh_: [Circular *2]
          },
          _llp_: [Circular *2],
          _llh_: [Circular *2]
        },
        _llp_: <ref *8> {
          count: 8388608,
          time: 1632982157888,
          _wq_: Defer { isPending: true, isResolved: false, isRejected: false, _promise_resolve: [Function (anonymous)], _promise_reject: [Function (anonymous)], promise: Promise { <pending> } },
          _lln_: [Circular *2],
          _llp_: <ref *9> {
            count: 8388608,
            time: 1632982157784,
            _wq_: Defer { isPending: true, isResolved: false, isRejected: false, _promise_resolve: [Function (anonymous)], _promise_reject: [Function (anonymous)], promise: Promise { <pending> } },
            _lln_: [Circular *8],
            _llp_: <ref *10> {
              count: 8388608,
              time: 1632982157778,
              _wq_: Defer { isPending: true, isResolved: false, isRejected: false, _promise_resolve: [Function (anonymous)], _promise_reject: [Function (anonymous)], promise: Promise { <pending> } },
              _lln_: [Circular *9],
              _llp_: <ref *11> {
                count: 8388608,
                time: 1632982157631,
                _wq_: Defer { isPending: true, isResolved: false, isRejected: false, _promise_resolve: [Function (anonymous)], _promise_reject: [Function (anonymous)], promise: Promise { <pending> } },
                _lln_: [Circular *10],
                _llp_: <ref *12> { count: 8388608, time: 1632982157630, _wq_: Defer { isPending: true, isResolved: false, isRejected: false, _promise_resolve: [Function (anonymous)], _promise_reject: [Function (anonymous)], promise: Promise { <pending> } }, _lln_: [Circular *11], _llp_: <ref *13> { count: 8388608, time: 1632982157629, _wq_: Defer { isPending: true, isResolved: false, isRejected: false, _promise_resolve: [Function (anonymous)], _promise_reject: [Function (anonymous)], promise: [Promise] }, _lln_: [Circular *12], _llp_: { count: 8388608, time: 1632982157618, _wq_: [Defer], _lln_: [Circular *13], _llp_: [Object], _llh_: [Circular *2] }, _llh_: [Circular *2] }, _llh_: [Circular *2] },
                _llh_: [Circular *2]
              },
              _llh_: [Circular *2]
            },
            _llh_: [Circular *2]
          },
          _llh_: [Circular *2]
        },
        _llh_: [Circular *2]
      }
    },
    _verbose: false,
    _timeout: 120000,
    _timeout_error_code: 'IO_STREAM_ITEM_TIMEOUT',
    _work_timeout_error_code: 'SEMAPHORE_WORKER_TIMEOUT',
    _timer: Timeout { _idleTimeout: 101, _idlePrev: [TimersList], _idleNext: [TimersList], _idleStart: 233845656, _onTimeout: [Function (anonymous)], _timerArgs: undefined, _repeat: null, _destroyed: false, [Symbol(refed)]: true, [Symbol(kHasPrimitive)]: false, [Symbol(asyncId)]: 495467329, [Symbol(triggerId)]: 495467301 }
  }
}
Sep-30 6:09:18.136 [Endpoint/13] [ERROR] core.endpoint.s3.s3_rest:: S3 ERROR <?xml version="1.0" encoding="UTF-8"?><Error><Code>SlowDown</Code><Message>Reduce your request rate.</Message><Resource>/bucket-3/obj_112712</Resource><RequestId>ku6jccqc-3k264q-4eu</RequestId></Error> GET /bucket-3/obj_112712 {"accept-encoding":"identity","user-agent":"aws-cli/2.1.29 Python/3.8.8 Linux/4.18.0-305.19.1.el8_4.x86_64 exe/x86_64.rhel.8 prompt/off command/s3.sync","x-amz-date":"20210930T060718Z","x-amz-content-sha256":"e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855","authorization":"AWS4-HMAC-SHA256 Credential=BrVzxXwyhi470cqA3s0c/20210930/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=42087996203a733867660ef587b9e8a5dbe321f355c3c6bcdde2c0b5d9ba9ac8","host":"s3-noobaa.apps.ocp-akshat-1.cp.fyre.ibm.com","x-forwarded-host":"s3-noobaa.apps.ocp-akshat-1.cp.fyre.ibm.com","x-forwarded-port":"443","x-forwarded-proto":"https","forwarded":"for=10.17.66.193;host=s3-noobaa.apps.ocp-akshat-1.cp.fyre.ibm.com;proto=https","x-forwarded-for":"10.17.66.193"} Error: Semaphore Timeout
    at Semaphore._on_timeout (/root/node_modules/noobaa-core/src/util/semaphore.js:233:25)
    at Timeout.<anonymous> (/root/node_modules/noobaa-core/src/util/semaphore.js:229:53)
    at listOnTimeout (internal/timers.js:557:17)
    at processTimers (internal/timers.js:500:7)

More information - Screenshots / Logs / Other output

akmithal avatar Sep 30 '21 06:09 akmithal

must-gather-3009_2.tar.gz

Hi @jeniawhite , @nimrod-becker - This is something new found just now.

akmithal avatar Sep 30 '21 06:09 akmithal

Instead of this failure, shouldn't there be delay introduced for serving this IO request ?

Please see from logs that memory and CPU utilization wasn't high when these errors got triggered.

akmithal avatar Sep 30 '21 06:09 akmithal

@akmithal Works as designed, we return the slow-down message to the client and expect it to reduce the rate of requests. This is returned due to a bottleneck on memory utilization for the NSFS flows. The next step would be to add memory utilization to the HPA and fine-tune the amount of memory that we allocate for IO in the NSFS flows.

jeniawhite avatar Oct 03 '21 11:10 jeniawhite

Hi @jeniawhite , are we going to trim down the error logs for this error. If yes, you can use this bug to track this work. If it is going to be like this only / PR already open, will close this bug then.

akmithal avatar Oct 12 '21 07:10 akmithal

Hi @nimrod-becker , Is there a PR/workitem created to track this work "add memory utilization to HPA"? Else, we can use this defect for it. FYI @romayalon

akmithal avatar Apr 06 '22 06:04 akmithal

We can keep this issue until we have a task tracking this, basically, we wanted to go with something that eventually did not make it to openshift so we need to wait until the new way is available.

nimrod-becker avatar Apr 06 '22 08:04 nimrod-becker

@romayalon , could you link the JIRA ticket here for reference ?

rkomandu avatar Sep 06 '22 04:09 rkomandu

No Jira ticket, this needs to be scheduled for a certain release

nimrod-becker avatar Sep 06 '22 08:09 nimrod-becker