[STORM-1565] Multi-Lang Performance Improvements
https://issues.apache.org/jira/browse/STORM-1565
You may want to look at the Pyleus project's MessagePackSerializer, because I know they had to add some things in to handle weird edge cases. I'm not really familiar enough with implementing custom serializers to recognize why your approach and theirs look rather different.
Also, this would be amazing, as I was trying to convince the Pyleus folks to contribute their serializer to Storm anyway in Yelp/pyleus#159.
Thanks @dan-blanchard I'd love to hear from pyleus contributors.
cc @HeartSaVioR You might be interested in this improvement.
@vesense Sorry to reach out too late. Some comments for me:
- We might want to include this to multilang (like multilang-serializer) module so that it doesn't affect dependency if users don't use multilang feature. But I'm also OK to have this to core if we are OK to just shade msgpack.
- As @dan-blanchard stated we may want to have stable implementation of serializer if possible.
- I totally agree with @dan-blanchard that changing default behavior should be avoided unless releasing major version.
- It requires users to implement custom multi-lang logic to apply deserialization. IMO, I think it makes issue more complicated. We could provide implementation by default, but it will force user to have dependency at all, and more bad thing is that we need to address this at least three languages.
Personally I feel that we're stuck on having less-maintaining and non-library-dependent default implementations. I can take care of python / ruby implementation though I'm not expert on these, but have no idea on nodejs.
So it may need to discuss with current state of multi-lang support once rather than adding something continuously.
@HeartSaVioR Agree with you. We should have a discussion about multi-lang support. (Can open a thread on dev@ mailing list)
I'm not on the Storm dev mailing list yet—I guess I should probably fix that—but we actually already have a Python implementation of msgpack serialization in pystorm, the Python multi-lang implementation that powers streamparse.
I've been meaning to propose for a while that instead of having your own default Python multi-lang implementation in Storm that very few people use (because it's not very Pythonic or production-ready), you should instead point people to use pystorm at the very least. It provides all the functionality that the storm.py provides but with documentation, exception handling, logging, etc. pystorm is intended to be used as a starting point for more full-fledged Python-Storm interop libraries (like streamparse), but if you want to replace the bare bones provdided by storm.py with an enhanced version of the same thing, it seems like pystorm is the way to go.
@dan-blanchard Yeah, agreed. I've addressed several multi-lang issues so I had been interested on multi-lang feature of Storm, but few people have interest on maintaining / improving basic implementation of multilang protocol in various language even python. Pystorm seems great, and we lack contributors who care multi-lang implementations as one of major module so I'd still rather rely on stable implementation than build from scratch. Just one concern for me is that it seems to be heavily depending on you (I mean individual) so I'm curious we feel safe to rely on.
@dan-blanchard And please note that opinions are my own, so we should raise this to discussion and let community give various opinions if we would really want to.
Pystorm seems great, and we lack contributors who care multi-lang implementations as one of major module so I'd still rather rely on stable implementation than build from scratch. Just one concern for me is that it seems to be heavily depending on you (I mean individual) so I'm curious we feel safe to rely on.
I work full-time for a company that uses pystorm and streamparse in production, and we have a team dedicated to maintaining these projects, so I don't think you need to worry about that much.
OK great. Could you subscribe dev@ mailing list? I occasionally initiate discussion from there (multilang, too) so you might want to have a talk regarding to multilang.
Yup. I subscribed yesterday. On Wed, Jun 22, 2016 at 8:35 PM Jungtaek Lim [email protected] wrote:
OK great. Could you subscribe dev@ mailing list? I occasionally initiate discussion from there (multilang, too) so you might want to have a talk regarding to multilang.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apache/storm/pull/1136#issuecomment-227917710, or mute the thread https://github.com/notifications/unsubscribe/AA7l2aR0yXa0BnItmZom7dzA_50avEOzks5qOdTCgaJpZM4HfaQG .
@vesense Any update on this? It looks like pystorm/streamparse are waiting for this to be merged.
cc @dan-blanchard
And another problem is that it is required to specify the encoding character when using JSON serializing, default is utf-8 . It will raise exception if the object contains non-utf8 characters. Will MessagePackSerializer solve this problem?
In my opinion, I prefer to have this in multilang module rather than core. Upgrading storm cluster is not convenient.
Will MessagePackSerializer solve this problem?
It will if this code is updated to use the latest version of the msgpack-core library. As I mentioned in a previous comment:
This would be much more useful if it were implemented using a newer version of the msgpack-core library—they changed the name after 0.6.12—because 0.7 and above supports the
BINARYformat, which lets you send arbitrary bytes. Without that, you won't be able to send tuples containing arbitrary bytes with this serializer. This is also a problem with the JSON serializer (because JSON strings can't contain non-Unicode characters), but it would be great if we didn't have the problem here.
Thats great. I hope this pull request can be accepted and released soon.
So is there any update about merging this pull request?
Was chatting with @roshannaik and @dan-blanchard today, and this PR came up.
Someone on Storm team may benefit from taking a look at this as part of the performance revisions being done for Storm 2.0. As mentioned in #428 in streamparse -- the community project for running Python multi-lang topologies with Storm -- getting this merged somewhere in the Storm codebase would open up the possibility to switch serializer from Kryo & JSON to msgpack throughout, which would speed up multi-lang use cases considerably. This PR includes a pure Java implementation of a msgpack serializer, as well as pointers to the right msgpack library in the Java community; it just needs to be reviewed, tested, and merged.