enrich icon indicating copy to clipboard operation
enrich copied to clipboard

Stream: parallelize processing of raw events

Open chuwy opened this issue 4 years ago • 7 comments

chuwy avatar Jun 20 '20 10:06 chuwy

Migrated from https://github.com/snowplow/snowplow/issues/1537 (comments are auto-generated)

chuwy avatar Jun 20 '20 10:06 chuwy

I'll piggyback on this issue. I'm maintaining on an open-source Snowplow pipeline on AWS and ran into scaling issues with Enrich and Kinesis. It seems I can't increase the Enrich throughout unless I increase Kinesis shards (I'd like to avoid that); if there are more CPU cores than Kinesis shards the rest of the cores will sit idle more or less. I was hoping each stream processor would use more than one core to process a batch but looking at the source code (and this issue) looks like they're not.

What's your current point of view on this issue? Is this still something you're considering? I have very limited experience with Scala but with some guidance I'm sure we could produce a PR to add this feature (hope I'm not massively underestimated the effort). I'd be happy to contribute back to Snowplow!

lenn4rd avatar Oct 04 '21 11:10 lenn4rd

Hi @lenn4rd , you're right that in Stream Enrich parallelism is tightly coupled to the number of shards and we don't have control over it.

Good news is that we are getting close to release a new enrich asset for Kinesis (based on fs2) and this asset will have a better use of the resources by default and it will be possible to fine-tune the parallelism through the configuration, independently from the number of shards.

The release post will come on discourse, stay tuned!

benjben avatar Oct 06 '21 08:10 benjben

Fantastic, thanks for the heads-up! Let me know if there's anything I can help with. I'm sure you have everything you need to test it but if you need another non-production pipeline to run it against, that can be arranged.

lenn4rd avatar Oct 06 '21 08:10 lenn4rd

Thanks a lot for your help ! We gladly welcome your proposal to early test it. We'd like to run a a few more tests first and then we'll ask you.

benjben avatar Oct 06 '21 08:10 benjben

Hi @lenn4rd , sorry I forgot to come back to you about this. You can already use snowplow/stream-enrich-kinesis:3.0.0-rc41 (final version will be released this month and be announced on discourse). Resources should be better used than with Stream Enrich. You can find the minimal configuration sample for it here and a detailed one here. I think that you will be interested into this parameter.

benjben avatar Feb 18 '22 13:02 benjben

Thanks for the heads-up, @benjben! I'll check out the release candidate or the final version depending on how soon I'll get to it.

lenn4rd avatar Feb 23 '22 09:02 lenn4rd