enrich
enrich copied to clipboard
Stream: parallelize processing of raw events
Migrated from https://github.com/snowplow/snowplow/issues/1537 (comments are auto-generated)
I'll piggyback on this issue. I'm maintaining on an open-source Snowplow pipeline on AWS and ran into scaling issues with Enrich and Kinesis. It seems I can't increase the Enrich throughout unless I increase Kinesis shards (I'd like to avoid that); if there are more CPU cores than Kinesis shards the rest of the cores will sit idle more or less. I was hoping each stream processor would use more than one core to process a batch but looking at the source code (and this issue) looks like they're not.
What's your current point of view on this issue? Is this still something you're considering? I have very limited experience with Scala but with some guidance I'm sure we could produce a PR to add this feature (hope I'm not massively underestimated the effort). I'd be happy to contribute back to Snowplow!
Hi @lenn4rd , you're right that in Stream Enrich parallelism is tightly coupled to the number of shards and we don't have control over it.
Good news is that we are getting close to release a new enrich asset for Kinesis (based on fs2) and this asset will have a better use of the resources by default and it will be possible to fine-tune the parallelism through the configuration, independently from the number of shards.
The release post will come on discourse, stay tuned!
Fantastic, thanks for the heads-up! Let me know if there's anything I can help with. I'm sure you have everything you need to test it but if you need another non-production pipeline to run it against, that can be arranged.
Thanks a lot for your help ! We gladly welcome your proposal to early test it. We'd like to run a a few more tests first and then we'll ask you.
Hi @lenn4rd , sorry I forgot to come back to you about this. You can already use snowplow/stream-enrich-kinesis:3.0.0-rc41
(final version will be released this month and be announced on discourse). Resources should be better used than with Stream Enrich
. You can find the minimal configuration sample for it here and a detailed one here. I think that you will be interested into this parameter.
Thanks for the heads-up, @benjben! I'll check out the release candidate or the final version depending on how soon I'll get to it.