pulsar-client-python icon indicating copy to clipboard operation
pulsar-client-python copied to clipboard

Pulsar Identity SerDe behaviour and documentation

Open baynes opened this issue 5 years ago • 5 comments
trafficstars

If you look at http://pulsar.apache.org/docs/en/functions-develop/#serde under the Python tab it says

"In Python, the default SerDe is identity, meaning that the type is serialized as whatever type the producer function returns." "You can use the IdentitySerde, which leaves the data unchanged. The IdentitySerDe is the default."

This strongly gives the impression that the default `IdentitySerDe1 does not change the message in any way. This is not the case -- it will attempt to convert incoming bytes to one of float, int, string and only leaves it as bytes when they fail. This can result in unexpected conversions (we have had binary data unexpectedly converted to string).

It also attempts the reverse on the function result. Fortunately this does not result in unexpected behaviour, though does lead to muddled/sloppy programming as people are careless with the type of return value.

There are options as how to correct this:

1: Fix the code so the IdentitySerDe is just that - it leaves the bytes unchanged. One could then have a Paddington Bear SerDe (well intentioned and helpful but tends to get things wrong) which does what the existing IdentitySerDe does and also have FloatSerDe, IntSerDe and StringSerDe to cover the other cases reliably.

2: Change the documentation on the IdentitySerDe to explain what it really does and its dangers but leave it as the default. Introduce FloatSerDe, IntSerDe,StringSerDe and BytesSerDe to cover the cases reliably.

3: Fix the code so the IdentitySerDe is just that - it leaves the bytes unchanged. Also have FloatSerDe, IntSerDe and StringSerDe to cover the other cases reliably. Make StringSerDe the default on the guess this is the most common use case.

My preference would be option 1, but I suspect the installed code base would need option 2. 3 is a sort of compromise.

baynes avatar Oct 22 '20 14:10 baynes

Option 1 should be preferred. Are you interested in contributing a bug fix?

sijie avatar Oct 22 '20 17:10 sijie

Option 1 should be preferred. Are you interested in contributing a bug fix?

Maybe. But It won't be for until after Christmas before I have any time to try. I am happy for someone else to take it on if they want to.

baynes avatar Dec 02 '20 13:12 baynes

Seems an issue still relevant to this code snippet:

https://github.com/apache/pulsar-client-python/blob/75a57b427d4c6944c49f4b712344107b5444aa36/pulsar/functions/serde.py#L75-L94

Transferred.

Also, ping @baynes I believe this can be a good first issue :)

tisonkun avatar Dec 11 '22 04:12 tisonkun

While I am interested in this, I don't quite get the point of implementing IdentitySerializer as such, is there any practical meaning to keep the bytes as-is? Is it for performance, inspection, compatibility, or other purposes I missed here?

If not, I would like to add a special mark in the serialized form of a bytes data, to tell the difference from the other data types.

frostming avatar Dec 12 '22 01:12 frostming

@tisonkun I think the IdentitySerDe should be called SimpleSerDe and the doc for it should clearly say it will pass (string, int, float, complex, and bytes) values as it is or raise an error for other types. We also need to handle each type separately instead of converting all into str

@frostming people can handle the raw bytes with more flexibility in their own business logic code and potentially higher performance.

nlu90 avatar Dec 21 '22 00:12 nlu90