django-scarface icon indicating copy to clipboard operation
django-scarface copied to clipboard

Incorrect use of sys.getsizeof in trim_message

Open roosmaa opened this issue 7 years ago • 2 comments

In platform_strategy.py:

    def trim_message(self, message):
        import sys
        trim_length = SCARFACE_DEFAULT_MESSAGE_TRIM_LENGTH
        if hasattr(settings, 'SCARFACE_MESSAGE_TRIM_LENGTH'):
            trim_length = settings.SCARFACE_MESSAGE_TRIM_LENGTH

        if sys.getsizeof(message) > trim_length:
            while sys.getsizeof(message) > trim_length:
                message = message[:-3]
            message += '...'
        return message

Usage of sys.getsizeof is fundamentally incorrect to measuring message length. getsizeof returns the internal size of the Python object, not the actual encoded size of the final message. TL;DR; it kind of works for ascii, but utterly breaks down for unicode strings.

Here's a quick demonstration from an interactive shell:

>>> sys.getsizeof('') # Basic empty string takes up 49 bytes
49
>>> sys.getsizeof('For ascii string it kind of works, but not really')
98
>>> sys.getsizeof('Let\'s add some ünicode to themix, 👌🏻')
220
>>> # Quickly the sizeof spikes, as it is the internal representation after all.
>>> # Let's try again by not relying on how the string is implemented behind the
>>> # scenes. Instead we encode the string to a byte array, which most likely
>>> # has less magic and grows in size more predictably.
>>> sys.getsizeof('For ascii string it kind of works, but not really'.encode('utf8'))
82
>>> sys.getsizeof('Let\'s add some ünicode to themix, 👌🏻'.encode('utf8'))
76
>>> # Much better, but in this context, getsizeof really shouldn't be used,
>>> # instead we should be counting the bytes that will be actually used up
>>> # when sending the pushes.
>>> len('For ascii string it kind of works, but not really'.encode('utf8'))
49
>>> len('Let\'s add some ünicode to themix, 👌🏻'.encode('utf8'))
43

roosmaa avatar May 03 '17 09:05 roosmaa