components-contrib
components-contrib copied to clipboard
snssqs: fix consumer starvation
Closes #3479
Thanks for this PR. I can see why the change is useful and agree the current implementation has some issues.
However, by removing all waiters, what I am concerned with now is that Dapr could be "overloaded" if too many messages are received at once, as it would create new goroutines without any limit.
Perhaps a better solution could be to limit the number of concurrent goroutines that are created? Some other components (like Azure Service Bus) have similar configuration options where you can set a max number of active messages. This way there's also some "natural backpressure" implemented.
Thanks for this PR. I can see why the change is useful and agree the current implementation has some issues.
However, by removing all waiters, what I am concerned with now is that Dapr could be "overloaded" if too many messages are received at once, as it would create new goroutines without any limit.
Perhaps a better solution could be to limit the number of concurrent goroutines that are created? Some other components (like Azure Service Bus) have similar configuration options where you can set a max number of active messages. This way there's also some "natural backpressure" implemented.
Backpressure is definitively valuable, and from what I can see, implementations like Redis are already controlling the number of goroutines: https://github.com/dapr/components-contrib/blob/main/pubsub/redis/redis.go#L89.
However, this sounds like a global concern that could be implemented at the pubsub level and not as a driver feature. I think that a global concurrency control allow operators to switch from pubsub implementation keeping the same backpressure settings. Does that makes sense?
However, this sounds like a global concern that could be implemented at the pubsub level and not as a driver feature. I think that a global concurrency control allow operators to switch from pubsub implementation keeping the same backpressure settings.
That does make sense, but it would require some major changes to a lot of components. Additionally integrating that would not be straightforward, since Dapr currently delegates concurrency control for this scenario to individual components.
If for now you could implement a maxConcurrentMessages (or maxParallelism) option, that would be nice and we can then fix this bug!
That does make sense, but it would require some major changes to a lot of components. Additionally integrating that would not be straightforward, since Dapr currently delegates concurrency control for this scenario to individual components.
You are right. Is this something you think its worth working on?
If for now you could implement a
maxConcurrentMessages(ormaxParallelism) option, that would be nice and we can then fix this bug!
Sure, I'm happy to add the feature. Initially I thought about concurrencyLimit which matches the concurrencyMode prefix, but I don't have a string opinion.
Also, some questions:
- what do you think is a sensible default?
- should we allow the operator to uncap the concurrency limit (let's say you want to control the limit in the caller's code (outside dapr) ?
Initially I thought about concurrencyLimit which matches the concurrencyMode prefix, but I don't have a string opinion.
concurrencyLimit is a good value!
what do you think is a sensible default?
I just looked at the Azure Service Bus component and looks like the default is actually 0, i.e. unlimited. I'm not a fan of unlimited, but for consistency it could make sense. However, I'll accept other suggestions too!
should we allow the operator to uncap the concurrency limit (let's say you want to control the limit in the caller's code (outside dapr) ?
My gut initially said "no", but then I saw that other components do allow un-capping the limit (like Service Bus), so perhaps it does make sense. (although I would personally not recommend controlling the limit on the caller (i.e. the publisher) side)
concurrencyLimitis a good value!
Ok, let'a use that one then.
I just looked at the Azure Service Bus component and looks like the default is actually 0, i.e. unlimited. I'm not a fan of unlimited, but for consistency it could make sense. However, I'll accept other suggestions too!
While Azure and --app-max-concurrency defaults to unlimited, Redis defaults to "10". However, since there is a wide range of use cases, I think default to unlimited makes more sense.
My gut initially said "no", but then I saw that other components do allow un-capping the limit (like Service Bus), so perhaps it does make sense. (although I would personally not recommend controlling the limit on the caller (i.e. the publisher) side)
Right, by caller's code I mean the Subscriber's handler. An analogous example will be Go's http server, which does not impose any limitation on the number of handlers and it's up to the user to set such a limit.
In summary, I'd document this as follows:
| Field | Required | Details | Example |
|---|---|---|---|
| concurrencyLimit | N | Defines the maximum number of concurrent workers handling messages. This value is ignored when concurrentMode is set to "single". To avoid limiting the number of concurrent workers set this to "0". Default: "0" |
"100" |
@qustavo can you open an issue in this repo and also explain the problem there then link to it in this PR? We don't usually accept PRs from non-maintainers/approvers without an issue being open first (and I don't mean issues that have been opened at the same time as the PR)
Please note that the 1.14 release is already underway and we have already cut the components-contrib release branch for 1.14. So this will not make it into the 1.14 release most likely. We can assess the need for this fix to improvement to make it into the 1.14 release within the issue (to be created).
@qustavo can you open an issue in this repo and also explain the problem there then link to it in this PR? We don't usually accept PRs from non-maintainers/approvers without an issue being open first (and I don't mean issues that have been opened at the same time as the PR)
Sure, will create issue
I'm ok with merging now - but in the absence of an issue this definitely cannot be considered to be included in 1.14.
/ok-to-test
Components certification test
Commit ref: 9acedbf9e5dbd012e846895ab16b301acdeadd49
❌ Some certification tests failed
These tests failed:
- bindings.azure.eventhubs
- bindings.azure.servicebusqueues
- pubsub.aws.snssqs
- pubsub.mqtt3
- secretstores.hashicorp.vault
- state.azure.cosmosdb
Components conformance test
Commit ref: 9acedbf9e5dbd012e846895ab16b301acdeadd49
❌ Some conformance tests failed
These tests failed:
- bindings.azure.eventgrid
- pubsub.aws.snssqs.terraform
Complete Build Matrix
The build status is currently not updated here. Please visit the action run below directly.
Commit ref: 9acedbf9e5dbd012e846895ab16b301acdeadd49
Unfortunately the SNS integration tests have been broken. I think this needs to be fixed by @sicoyle @artursouza or someone else with more AWS integration test experience (and the credentials to manage the test infrastructure).
While I'm sure nothing is inherently wrong with this PR, we just cannot be certain due to the broken tests (which were not broken by this PR).
@berndverst AFAICS the issue is with pubsub.mqtt3 certification. How is this related to SNS/SQS?
/ok-to-test
Complete Build Matrix
The build status is currently not updated here. Please visit the action run below directly.
Commit ref: 882ac66bf52118bcb444ea0c77fad97ef1d15f53
Components conformance test
Commit ref: 882ac66bf52118bcb444ea0c77fad97ef1d15f53
❌ Some conformance tests failed
These tests failed:
- bindings.azure.eventgrid
- bindings.influx
- bindings.kafka-confluent
- bindings.kafka-wurstmeister
- bindings.kubemq
- bindings.mqtt3-emqx
- bindings.mqtt3-mosquitto
- bindings.mqtt3-vernemq
- bindings.postgresql.docker
- bindings.rabbitmq
- bindings.redis.v6
- bindings.redis.v7
- configuration.postgresql.docker
- configuration.redis.v6
- configuration.redis.v7
- lock.redis.v6
- lock.redis.v7
- pubsub.aws.snssqs.terraform
- pubsub.jetstream
- pubsub.kafka-confluent
- pubsub.kafka-wurstmeister
- pubsub.kubemq
- pubsub.mqtt3-emqx
- pubsub.mqtt3-vernemq
- pubsub.pulsar
- pubsub.rabbitmq
- pubsub.redis.v6
- pubsub.solace
- secretstores.hashicorp.vault
- state.cassandra
- state.cockroachdb.v1
- state.etcd.v1
- state.etcd.v2
- state.memcached
- state.mysql.mariadb
- state.mysql.mysql
- state.oracledatabase
- state.postgresql.v1.docker
- state.postgresql.v2.docker
- state.redis.v6
- state.redis.v7
- state.rethinkdb
- state.sqlserver
Components certification test
Commit ref: 882ac66bf52118bcb444ea0c77fad97ef1d15f53
❌ Some certification tests failed
These tests failed:
- bindings.azure.eventhubs
- bindings.zeebe.command
- bindings.zeebe.jobworker
- bindings.kafka
- bindings.postgres
- bindings.rabbitmq
- bindings.redis
- configuration.postgres
- configuration.redis
- pubsub.aws.snssqs
- pubsub.kafka
- pubsub.mqtt3
- pubsub.pulsar
- pubsub.rabbitmq
- secretstores.hashicorp.vault
- state.azure.cosmosdb
- state.cassandra
- state.cockroachdb.v1
- state.memcached
- state.mongodb
- state.mysql
- state.postgresql.v1
- state.postgresql.v2
- state.redis
- state.sqlserver
any progress on this? This is currently stopping our sns/sqs deployment, please lmk if there's anything I can do to get it merged.
any progress on this? This is currently stopping our sns/sqs deployment, please lmk if there's anything I can do to get it merged.
Taking a look today
any progress on this? This is currently stopping our sns/sqs deployment, please lmk if there's anything I can do to get it merged.
Taking a look today
any insight?
@qustavo the build is failing, please see the errors
@qustavo the build is failing, please see the errors
Sorry, I assumed it was a certification error, seems like it was a merge issue, will fix.
/ok-to-test
@ItalyPaleAle @yaron2 Trying to reproduce the CI error by running:
DAPR_TEST_MQTT_URL=tcp://localhost:1883 go test ./... -v
But locally all my tests are passing, not really sure how to reproduce it, any suggestion?
/ok-to-test
Components conformance test
Commit ref: 7827b515e24fbbe11fa4859ad7d59fd67d838acc
❌ Some conformance tests failed
These tests failed:
- bindings.azure.eventgrid
Complete Build Matrix
The build status is currently not updated here. Please visit the action run below directly.
Commit ref: 7827b515e24fbbe11fa4859ad7d59fd67d838acc
Components certification test
Commit ref: 7827b515e24fbbe11fa4859ad7d59fd67d838acc
❌ Some certification tests failed
These tests failed:
- pubsub.gcp.pubsub
- state.azure.cosmosdb
- state.sqlserver