components-contrib icon indicating copy to clipboard operation
components-contrib copied to clipboard

snssqs: fix consumer starvation

Open qustavo opened this issue 1 year ago • 25 comments

Closes #3479

qustavo avatar Jul 05 '24 14:07 qustavo

Thanks for this PR. I can see why the change is useful and agree the current implementation has some issues.

However, by removing all waiters, what I am concerned with now is that Dapr could be "overloaded" if too many messages are received at once, as it would create new goroutines without any limit.

Perhaps a better solution could be to limit the number of concurrent goroutines that are created? Some other components (like Azure Service Bus) have similar configuration options where you can set a max number of active messages. This way there's also some "natural backpressure" implemented.

ItalyPaleAle avatar Jul 05 '24 21:07 ItalyPaleAle

Thanks for this PR. I can see why the change is useful and agree the current implementation has some issues.

However, by removing all waiters, what I am concerned with now is that Dapr could be "overloaded" if too many messages are received at once, as it would create new goroutines without any limit.

Perhaps a better solution could be to limit the number of concurrent goroutines that are created? Some other components (like Azure Service Bus) have similar configuration options where you can set a max number of active messages. This way there's also some "natural backpressure" implemented.

Backpressure is definitively valuable, and from what I can see, implementations like Redis are already controlling the number of goroutines: https://github.com/dapr/components-contrib/blob/main/pubsub/redis/redis.go#L89.

However, this sounds like a global concern that could be implemented at the pubsub level and not as a driver feature. I think that a global concurrency control allow operators to switch from pubsub implementation keeping the same backpressure settings. Does that makes sense?

qustavo avatar Jul 05 '24 21:07 qustavo

However, this sounds like a global concern that could be implemented at the pubsub level and not as a driver feature. I think that a global concurrency control allow operators to switch from pubsub implementation keeping the same backpressure settings.

That does make sense, but it would require some major changes to a lot of components. Additionally integrating that would not be straightforward, since Dapr currently delegates concurrency control for this scenario to individual components.

If for now you could implement a maxConcurrentMessages (or maxParallelism) option, that would be nice and we can then fix this bug!

ItalyPaleAle avatar Jul 05 '24 22:07 ItalyPaleAle

That does make sense, but it would require some major changes to a lot of components. Additionally integrating that would not be straightforward, since Dapr currently delegates concurrency control for this scenario to individual components.

You are right. Is this something you think its worth working on?

If for now you could implement a maxConcurrentMessages (or maxParallelism) option, that would be nice and we can then fix this bug!

Sure, I'm happy to add the feature. Initially I thought about concurrencyLimit which matches the concurrencyMode prefix, but I don't have a string opinion. Also, some questions:

  • what do you think is a sensible default?
  • should we allow the operator to uncap the concurrency limit (let's say you want to control the limit in the caller's code (outside dapr) ?

qustavo avatar Jul 05 '24 22:07 qustavo

Initially I thought about concurrencyLimit which matches the concurrencyMode prefix, but I don't have a string opinion.

concurrencyLimit is a good value!

what do you think is a sensible default?

I just looked at the Azure Service Bus component and looks like the default is actually 0, i.e. unlimited. I'm not a fan of unlimited, but for consistency it could make sense. However, I'll accept other suggestions too!

should we allow the operator to uncap the concurrency limit (let's say you want to control the limit in the caller's code (outside dapr) ?

My gut initially said "no", but then I saw that other components do allow un-capping the limit (like Service Bus), so perhaps it does make sense. (although I would personally not recommend controlling the limit on the caller (i.e. the publisher) side)

ItalyPaleAle avatar Jul 05 '24 23:07 ItalyPaleAle

concurrencyLimit is a good value!

Ok, let'a use that one then.

I just looked at the Azure Service Bus component and looks like the default is actually 0, i.e. unlimited. I'm not a fan of unlimited, but for consistency it could make sense. However, I'll accept other suggestions too!

While Azure and --app-max-concurrency defaults to unlimited, Redis defaults to "10". However, since there is a wide range of use cases, I think default to unlimited makes more sense.

My gut initially said "no", but then I saw that other components do allow un-capping the limit (like Service Bus), so perhaps it does make sense. (although I would personally not recommend controlling the limit on the caller (i.e. the publisher) side)

Right, by caller's code I mean the Subscriber's handler. An analogous example will be Go's http server, which does not impose any limitation on the number of handlers and it's up to the user to set such a limit.

In summary, I'd document this as follows:

Field Required Details Example
concurrencyLimit N Defines the maximum number of concurrent workers handling messages. This value is ignored when concurrentMode is set to "single". To avoid limiting the number of concurrent workers set this to "0". Default: "0" "100"

qustavo avatar Jul 06 '24 00:07 qustavo

@qustavo can you open an issue in this repo and also explain the problem there then link to it in this PR? We don't usually accept PRs from non-maintainers/approvers without an issue being open first (and I don't mean issues that have been opened at the same time as the PR)

Please note that the 1.14 release is already underway and we have already cut the components-contrib release branch for 1.14. So this will not make it into the 1.14 release most likely. We can assess the need for this fix to improvement to make it into the 1.14 release within the issue (to be created).

berndverst avatar Jul 08 '24 17:07 berndverst

@qustavo can you open an issue in this repo and also explain the problem there then link to it in this PR? We don't usually accept PRs from non-maintainers/approvers without an issue being open first (and I don't mean issues that have been opened at the same time as the PR)

Sure, will create issue

qustavo avatar Jul 08 '24 17:07 qustavo

I'm ok with merging now - but in the absence of an issue this definitely cannot be considered to be included in 1.14.

berndverst avatar Jul 08 '24 17:07 berndverst

/ok-to-test

berndverst avatar Jul 08 '24 17:07 berndverst

Components certification test

🔗 Link to Action run

Commit ref: 9acedbf9e5dbd012e846895ab16b301acdeadd49

❌ Some certification tests failed

These tests failed:

  • bindings.azure.eventhubs
  • bindings.azure.servicebusqueues
  • pubsub.aws.snssqs
  • pubsub.mqtt3
  • secretstores.hashicorp.vault
  • state.azure.cosmosdb

dapr-bot avatar Jul 08 '24 17:07 dapr-bot

Components conformance test

🔗 Link to Action run

Commit ref: 9acedbf9e5dbd012e846895ab16b301acdeadd49

❌ Some conformance tests failed

These tests failed:

  • bindings.azure.eventgrid
  • pubsub.aws.snssqs.terraform

dapr-bot avatar Jul 08 '24 17:07 dapr-bot

Complete Build Matrix

The build status is currently not updated here. Please visit the action run below directly.

🔗 Link to Action run

Commit ref: 9acedbf9e5dbd012e846895ab16b301acdeadd49

dapr-bot avatar Jul 08 '24 17:07 dapr-bot

Unfortunately the SNS integration tests have been broken. I think this needs to be fixed by @sicoyle @artursouza or someone else with more AWS integration test experience (and the credentials to manage the test infrastructure).

While I'm sure nothing is inherently wrong with this PR, we just cannot be certain due to the broken tests (which were not broken by this PR).

berndverst avatar Jul 08 '24 22:07 berndverst

@berndverst AFAICS the issue is with pubsub.mqtt3 certification. How is this related to SNS/SQS?

qustavo avatar Jul 19 '24 12:07 qustavo

/ok-to-test

berndverst avatar Aug 28 '24 18:08 berndverst

Complete Build Matrix

The build status is currently not updated here. Please visit the action run below directly.

🔗 Link to Action run

Commit ref: 882ac66bf52118bcb444ea0c77fad97ef1d15f53

dapr-bot avatar Aug 28 '24 18:08 dapr-bot

Components conformance test

🔗 Link to Action run

Commit ref: 882ac66bf52118bcb444ea0c77fad97ef1d15f53

❌ Some conformance tests failed

These tests failed:

  • bindings.azure.eventgrid
  • bindings.influx
  • bindings.kafka-confluent
  • bindings.kafka-wurstmeister
  • bindings.kubemq
  • bindings.mqtt3-emqx
  • bindings.mqtt3-mosquitto
  • bindings.mqtt3-vernemq
  • bindings.postgresql.docker
  • bindings.rabbitmq
  • bindings.redis.v6
  • bindings.redis.v7
  • configuration.postgresql.docker
  • configuration.redis.v6
  • configuration.redis.v7
  • lock.redis.v6
  • lock.redis.v7
  • pubsub.aws.snssqs.terraform
  • pubsub.jetstream
  • pubsub.kafka-confluent
  • pubsub.kafka-wurstmeister
  • pubsub.kubemq
  • pubsub.mqtt3-emqx
  • pubsub.mqtt3-vernemq
  • pubsub.pulsar
  • pubsub.rabbitmq
  • pubsub.redis.v6
  • pubsub.solace
  • secretstores.hashicorp.vault
  • state.cassandra
  • state.cockroachdb.v1
  • state.etcd.v1
  • state.etcd.v2
  • state.memcached
  • state.mysql.mariadb
  • state.mysql.mysql
  • state.oracledatabase
  • state.postgresql.v1.docker
  • state.postgresql.v2.docker
  • state.redis.v6
  • state.redis.v7
  • state.rethinkdb
  • state.sqlserver

dapr-bot avatar Aug 28 '24 18:08 dapr-bot

Components certification test

🔗 Link to Action run

Commit ref: 882ac66bf52118bcb444ea0c77fad97ef1d15f53

❌ Some certification tests failed

These tests failed:

  • bindings.azure.eventhubs
  • bindings.zeebe.command
  • bindings.zeebe.jobworker
  • bindings.kafka
  • bindings.postgres
  • bindings.rabbitmq
  • bindings.redis
  • configuration.postgres
  • configuration.redis
  • pubsub.aws.snssqs
  • pubsub.kafka
  • pubsub.mqtt3
  • pubsub.pulsar
  • pubsub.rabbitmq
  • secretstores.hashicorp.vault
  • state.azure.cosmosdb
  • state.cassandra
  • state.cockroachdb.v1
  • state.memcached
  • state.mongodb
  • state.mysql
  • state.postgresql.v1
  • state.postgresql.v2
  • state.redis
  • state.sqlserver

dapr-bot avatar Aug 28 '24 18:08 dapr-bot

any progress on this? This is currently stopping our sns/sqs deployment, please lmk if there's anything I can do to get it merged.

qustavo avatar Sep 04 '24 13:09 qustavo

any progress on this? This is currently stopping our sns/sqs deployment, please lmk if there's anything I can do to get it merged.

Taking a look today

yaron2 avatar Sep 10 '24 15:09 yaron2

any progress on this? This is currently stopping our sns/sqs deployment, please lmk if there's anything I can do to get it merged.

Taking a look today

any insight?

qustavo avatar Sep 25 '24 12:09 qustavo

@qustavo the build is failing, please see the errors

yaron2 avatar Oct 15 '24 16:10 yaron2

@qustavo the build is failing, please see the errors

Sorry, I assumed it was a certification error, seems like it was a merge issue, will fix.

qustavo avatar Oct 15 '24 23:10 qustavo

/ok-to-test

yaron2 avatar Oct 17 '24 16:10 yaron2

@ItalyPaleAle @yaron2 Trying to reproduce the CI error by running:

DAPR_TEST_MQTT_URL=tcp://localhost:1883 go test ./...  -v

But locally all my tests are passing, not really sure how to reproduce it, any suggestion?

qustavo avatar Oct 23 '24 14:10 qustavo

/ok-to-test

berndverst avatar Oct 25 '24 23:10 berndverst

Components conformance test

🔗 Link to Action run

Commit ref: 7827b515e24fbbe11fa4859ad7d59fd67d838acc

❌ Some conformance tests failed

These tests failed:

  • bindings.azure.eventgrid

dapr-bot avatar Oct 25 '24 23:10 dapr-bot

Complete Build Matrix

The build status is currently not updated here. Please visit the action run below directly.

🔗 Link to Action run

Commit ref: 7827b515e24fbbe11fa4859ad7d59fd67d838acc

dapr-bot avatar Oct 25 '24 23:10 dapr-bot

Components certification test

🔗 Link to Action run

Commit ref: 7827b515e24fbbe11fa4859ad7d59fd67d838acc

❌ Some certification tests failed

These tests failed:

  • pubsub.gcp.pubsub
  • state.azure.cosmosdb
  • state.sqlserver

dapr-bot avatar Oct 25 '24 23:10 dapr-bot