django-imagekit icon indicating copy to clipboard operation
django-imagekit copied to clipboard

results of source_name_as_path changed between Python 2 and 3, isn't it?!?

Open jedie opened this issue 6 years ago • 7 comments

I updated a old project and among many other updates, python v2 is replaced by v3.

The problem is, the hash in URLs are changed. I use the default cache file namer:

IMAGEKIT_SPEC_CACHEFILE_NAMER = 'imagekit.cachefiles.namers.source_name_as_path'

It seems that imagekit.cachefiles.namers.source_name_as_path will return other urls because of different results from imagekit.hashers.CanonicalizingPickler

Think the problem is the changed pickle protocols from v2 to v3, see:

https://docs.python.org/3/library/pickle.html#pickle-protocols

EDIT: Now i see, that protocol=0 is used. Then nothing should change.

jedie avatar Jun 26 '18 09:06 jedie

No it's related to pickle... I made a independent test script:

from __future__ import print_function

import sys
from copy import copy
from hashlib import md5
from io import BytesIO
from pickle import DICT, MARK

try:
    from pickle import _Pickler
except ImportError:
    from pickle import Pickler as _Pickler


class CanonicalizingPickler(_Pickler):
    dispatch = copy(_Pickler.dispatch)

    def save_set(self, obj):
        rv = obj.__reduce_ex__(0)
        rv = (rv[0], (sorted(rv[1][0]),), rv[2])
        self.save_reduce(obj=obj, *rv)

    dispatch[set] = save_set

    def save_dict(self, obj):
        write = self.write
        write(MARK + DICT)

        self.memoize(obj)
        self._batch_setitems(sorted(obj.items()))

    dispatch[dict] = save_dict


def pickle(obj):
    file = BytesIO()
    CanonicalizingPickler(file, 0).dump(obj)
    value = file.getvalue()
    print("pickled to:", repr(value))
    return md5(value).hexdigest()


def test():
    print("Python v", sys.version.split(" ", 1)[0])

    hash = pickle(['photos/thumbnails/bulldog.jpg', ["foo", "bar"], 'jpeg', {'quality': 90}, True])

    print("Hash: %r" % hash)


if __name__ == "__main__":
    test()

results are:

Python v 3.6.3 pickled to: b'(lp0\nVphotos/thumbnails/bulldog.jpg\np1\na(lp2\nVfoo\np3\naVbar\np4\naaVjpeg\np5\na(dp6\nVquality\np7\nL90L\nsaI01\na.' Hash: '11348671166dac1eabca9232dc795557'

Python v 2.7.14 pickled to: "(lp0\nS'photos/thumbnails/bulldog.jpg'\np1\na(lp2\nS'foo'\np3\naS'bar'\np4\naaS'jpeg'\np5\na(dp6\nS'quality'\np7\nI90\nsaI01\na." Hash: 'f3730f70e2df9f188e4522ead475e243'

the pickled values: (lp0\nVphotos/thumbnails/bulldog.jpg\np1\na(lp2\nVfoo\np3\naVbar\np4\naaVjpeg\np5\na(dp6\nVquality\np7\nL90L\nsaI01\na. (lp0\nS'photos/thumbnails/bulldog.jpg'\np1\na(lp2\nS'foo'\np3\naS'bar'\np4\naaS'jpeg'\np5\na(dp6\nS'quality'\np7\nI90\nsaI01\na.

jedie avatar Jun 26 '18 10:06 jedie

In Python 3 default str is unicode and in Python2 it is bytes. I'm not sure if this is related or not.

I understand that it would be good if after upgrading from Python 2 to Python 3 to still use the same cached images, but at the end they are still cache which can be cleaned and new cache files will be generated and everything will still works.

If we currently change the logic for getting the hash in Python 3 to match Python 2 then we will force all current Python 3 users to regenerate their cache when they upgrade ImageKit next time (which is what is needed now if you migrate from Python 2 to Python 3).

The move from Python 2 to 3 is big enough and such things can be expected. If you have a better plan I will be glad to hear/read it and probably receive a PR.

vstoykov avatar Jun 28 '18 08:06 vstoykov

If you have a better plan I will be glad to hear/read it

No, sorry ;)

In the end, i have generated all images... But it took a long time... :)

I think the only thing to do is to document this fact, isn't it?

jedie avatar Jun 28 '18 08:06 jedie

Yes probably is a good idea this fact to be documented.

Did you have time for PR?

vstoykov avatar Jun 28 '18 11:06 vstoykov

In Python 3 default str is unicode and in Python2 it is bytes. I'm not sure if this is related or not.

Yes, I believe this is the root cause: The implementation of pickle in Python 3 always pickles strings as pickle protocol type 'UNICODE'. Which I suppose makes plenty of sense for Python 3..

Ref: Python 3.7 impl, Python 2.7 impl.

I think it is a mistake to use a python-internal protocol (pickle) for this, but as existing users we're stuck with hashes based on it..

I understand that it would be good if after upgrading from Python 2 to Python 3 to still use the same cached images, but at the end they are still cache which can be cleaned and new cache files will be generated and everything will still works.

This is a dangerously subtle bug: The entire cache is invalidated when switching Python runtimes, and for some sites, regenerating images may be extremely costly or even infeasible. Image a site with 1000s of images (or more). When using the Optimistic backend, all images break.

Working on a hack solution to replace the Pickle save_string implementation in Py3..

mik3y avatar Feb 16 '19 23:02 mik3y

Same thing for python3.6 vs python3.7+ in 3.7 small numbers are pickling as I_number_ while in 3.6 they are L_number_L

image

roman-karpovich avatar Jan 14 '20 10:01 roman-karpovich

Due to the number of images in our app (> 4M) it would have taken many days to re-generate images. I tried a number of times, and it failed. Therefore to get our project upgraded to Python 3 ASAP I have implemented a hacked solution which I'll share here in-case it helps anybody else - but it's probably not the best way forwards.

What I learnt digging about: Basically the cache hash generated by hashed the pickle of the source file name and the processors, format, options, and autoconvert. Because the result of pickling between python 2 and 3 produces different results, the generated hash changes.

Pickling is used so that cache files are regenerated if the ImageSpec changes. This will already cause an issue for our project since we are using the optimistic method which assumes the cached files are always generated. Therefore, if we change a spec even slightly, we would need to run generateimages again. Therefore, since we already had that drawback, we decided we could just get away with adding a version string for hashing rather than pickling.

So, the solution was the generate a version string by pickling the Spec in Python 2/Django 1.11, and changing the behaviour of get_hash() using that bytestring. This means the hash is the same going forward to Python 3 / Django 2.2+.

Yes, I know that we could have stopped using Optimistic and switched to a nonvolatile cache backend. But this would have led to unacceptable performance for every time an image is viewed for the first time.

Here's the code I added to our Python 2 / Django 1.11 branch to export the pickle templates

imagekitextras/management/commands/exportpickledimagegenerators.py

from django.core.management.base import BaseCommand
from myapp.imagegenerators import ImageSpec2, ImageSpec2
from imagekit.lib import StringIO
from imagekit.hashers import CanonicalizingPickler
from hashlib import md5

def pickle(obj):
    file = StringIO()
    CanonicalizingPickler(file, 0).dump(obj)
    return file.getvalue()

class Command(BaseCommand):
    help = 'Extensions to generate images'

    specs = [
        ('spec1', ImageSpec1),
        ('spec2, ImageSpec2),
    ]

    def handle(self, *args, **options):
        for name, spec in self.specs:
            to_pickle = [
                u'__IMAGE_FILE_NAME__',
                spec.processors,
                spec.format,
                spec.options,
                spec.autoconvert,
            ]
            print(name, pickle(to_pickle).replace('V__IMAGE_FILE_NAME__', '__IMAGE_FILE_NAME__'))

Then I used the output of these pickles in the Python 3 / Django 2.2 branch:

imagekitextras/py2compathasher.py

from hashlib import md5
from imagekit.hashers import CanonicalizingPickler
from imagekit.lib import StringIO


def pickle(obj):
    file = StringIO()
    CanonicalizingPickler(file, 0).dump(obj)
    return file.getvalue()


class Py2ImageSpecPickleMixin(object):
    """ A mixin for ImageSpecs that uses Python2/Django 1.11 pickling data """

    @classmethod
    def _get_pre_hash_pickle(cls, source_name):
        if not hasattr(cls, 'py2pickle'):
            raise ValueError('Classes inheriting from Py2ImageSpecPickleMixin must provide a py2pickle property')

        pickled_name = pickle(source_name)
        pickled_name = pickled_name[:0-len(b'\np0\n.')]
        pickled_val = cls.py2pickle.replace(b'__IMAGE_FILE_NAME__', pickled_name)
        return pickled_val

    def get_hash(self):
        if not hasattr(self, 'py2pickle'):
            raise ValueError('Classes inheriting from Py2ImageSpecPickleMixin must provide a py2pickle property')

        pickled_val = self._get_pre_hash_pickle(self.source.name)
        return md5(pickled_val).hexdigest()

myapp/imagegenerators.py

from imagekit import ImageSpec, register
from imagekit.specs.sourcegroups import ImageFieldSourceGroup
from pilkit.processors.base import MakeOpaque
from pilkit.processors.resize import ResizeToFit, ResizeToFill

from imagekitextras.py2compathasher import Py2ImageSpecPickleMixin
from myapp.models import MyModel


class ImageSpec1(Py2ImageSpecPickleMixin, ImageSpec):
    processors = [ResizeToFit(377, 377, mat_color=(255, 255, 255, 0))]
    format = 'JPEG'
    options = {'quality': 60}
    py2pickle = b"(lp0\n__IMAGE_FILE_NAME__\np1\na(lp2\nccopy_reg\n_reconstructor\np3\n(cpilkit.processors.resize\nResizeToFit\np4\nc__builtin__\nobject\np5\nNtp6\nRp7\n(dp8\nS'anchor'\np9\nS'c'\np10\nsS'height'\np11\nI377\nsS'mat_color'\np12\n(I255\nI255\nI255\nI0\ntp13\nsS'upscale'\np14\nI01\nsS'width'\np15\nI377\nsbaaS'JPEG'\np16\na(dp17\nS'quality'\np18\nI60\nsaI01\na."

# ... Rest of file

I hope this helps somebody -- as it's taken a couple of days to arrive at this solution.

humphrey avatar Jun 03 '20 05:06 humphrey