Willow icon indicating copy to clipboard operation
Willow copied to clipboard

Make image format detection and `SvgImageFile` initialisation more robust

Open zerolab opened this issue 11 months ago • 8 comments

https://pastebin.com/jNWdCZ2N

File "/usr/local/lib/python3.8/site-packages/willow/image.py", line 313, in __init__
cms-app-6b79f8884b-tmjzb cms-app     self.dom = ElementTree.parse(f)
cms-app-6b79f8884b-tmjzb cms-app   File "/usr/local/lib/python3.8/site-packages/defusedxml/ElementTree.py", line 130, in parse
cms-app-6b79f8884b-tmjzb cms-app     return _parse(source, parser)
cms-app-6b79f8884b-tmjzb cms-app   File "/usr/local/lib/python3.8/xml/etree/ElementTree.py", line 1202, in parse
cms-app-6b79f8884b-tmjzb cms-app     tree.parse(source, parser)
cms-app-6b79f8884b-tmjzb cms-app   File "/usr/local/lib/python3.8/xml/etree/ElementTree.py", line 601, in parse
cms-app-6b79f8884b-tmjzb cms-app     parser.feed(data)
cms-app-6b79f8884b-tmjzb cms-app   File "/usr/local/lib/python3.8/xml/etree/ElementTree.py", 

Update: the issue seems to be with https://github.com/wagtail/Willow/blob/830aa3d386fd2ef2aa48c8032ba7d62bf2e04fc1/willow/image.py#L82-L99 (and specifically https://github.com/wagtail/Willow/blob/830aa3d386fd2ef2aa48c8032ba7d62bf2e04fc1/willow/image.py#L86-L87) where the user is uploading a PNG (allegedly)

zerolab avatar Mar 08 '24 14:03 zerolab

@zerolab It would be helpful to have an image file that causes this bug.

Stormheg avatar Mar 08 '24 18:03 Stormheg

Yeah.. I have been trying.. https://wagtailcms.slack.com/archives/C81FGJR2S/p1709908265029289

I suspect they are uploading images without an extension, and then there's potentially something with the system that filetype cannot determine the file so Willow thinks it may be svg-ish

zerolab avatar Mar 08 '24 18:03 zerolab

OK, so finally got confirmation. It was images without an extension, so https://github.com/wagtail/Willow/blob/830aa3d386fd2ef2aa48c8032ba7d62bf2e04fc1/willow/image.py#L86-L87 is kicking in

zerolab avatar Mar 12 '24 10:03 zerolab

Honestly - does it matter if we have a file that fails?

I thought he mentioned there was a file extension visible in the database. (his examples also show extension)

We should probably fallback on extensions after checking the mimetype; and then try to guess if it's SVG, no? Seems like a generally good improvement to me.

Nigel2392 avatar Mar 12 '24 13:03 Nigel2392

We have a custom model inheriting from AbstractImage:

class Asset(AbstractImage):
    file = models.ImageField(
        verbose_name=_('file'),
        upload_to=settings.ASSET_UPLOAD_PREFIX,
        width_field='width',
        height_field='height',
        storage=MediaStorage()
    )

MediaStorage here is using django-gcloud-storage

When I upload an image I get the following error:

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0

Which is weird as I am uploading a png or jpeg image.I tried different images on different environments so this issue is not arising from a corrupted image file.

After further investigation and @zerolab and @Nigel2392 think the problem is arising when willow is guessing the wrong file type /extension in https://github.com/wagtail/Willow/blob/830aa3d386fd2ef2aa48c8032ba7d62bf2e04fc1/willow/image.py#L86-L87

The code blame shows that this part of the code was added last year. I am upgrading from Wagtail 4.2 to Wagtail 5, which would explain why it never occurred before.

When I run the following:

import filetype
from wagtail.images import get_image_model

Image = get_image_model()

image = Image.objects.get(file="dev/chair-unsplash.jpg")
with image.open_file() as image_file:
    ext = filetype.guess_extension(image_file)
    print(f"Image {image.pk} {image.file} has extension {ext}")

I get Image 149277 dev/chair-unsplash.jpg has extension None

My guess is that it is unable to guess the correct file type for the BLOB when opening a file from gcloud storage.

nqcm avatar Mar 12 '24 13:03 nqcm

I still get the issue even if the extension column in DB is jpeg. The file itself has the extension as well.

nqcm avatar Mar 12 '24 13:03 nqcm

Hi, I am doing some more investigation and here's my findings

In [4]: image = Image.objects.get(file="cms-dev/chair-unsplash_TAJSD0d.jpg")

In [5]: image.file
Out[5]: <ImageFieldFile: cms-dev/chair-unsplash_TAJSD0d.jpg>

When I try to get the path I get unimplemented error, which is understandable as it is being read from Gcloud into a temp fle:

In [11]: image.file.path
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[11], line 1
----> 1 image.file.path

File /usr/local/lib/python3.10/site-packages/django/db/models/fields/files.py:59, in FieldFile.path(self)
     56 @property
     57 def path(self):
     58     self._require_file()
---> 59     return self.storage.path(self.name)

File /usr/local/lib/python3.10/site-packages/django/core/files/storage.py:128, in Storage.path(self, name)
    122 def path(self, name):
    123     """
    124     Return a local filesystem path where the file can be retrieved using
    125     Python's built-in open() function. Storage systems that can't be
    126     accessed using open() should *not* implement this method.
    127     """
--> 128     raise NotImplementedError("This backend doesn't support absolute paths.")

NotImplementedError: This backend doesn't support absolute paths.
  1. When using the open_file method I get the extension is None:
In [21]: with image.open_file() as image_file:
    ...:     ext = filetype.guess_extension(image_file)
    ...:     print(f"Image {image.pk} {image.file} has extension {ext}")
    ...:
[24/Mar/2024 11:28:35] | urllib3.connectionpool._get_conn:292 | DEBUG | Resetting dropped connection: storage.googleapis.com
[24/Mar/2024 11:28:35] | urllib3.connectionpool._make_request:549 | DEBUG | https://storage.googleapis.com:443 "GET /storage/v1/b/staging/o/cms-dev%2Fchair-unsplash_TAJSD0d.jpg?projection=noAcl&prettyPrint=false HTTP/1.1" 200 839
[24/Mar/2024 11:28:36] | urllib3.connectionpool._make_request:549 | DEBUG | https://storage.googleapis.com:443 "GET /download/storage/v1/b/staging/o/cms-dev%2Fchair-unsplash_TAJSD0d.jpg?generation=1710246298001500&alt=media HTTP/1.1" 200 323686
Image 149277 cms-dev/chair-unsplash_TAJSD0d.jpg has extension None
  1. When I open it using default_storage.open method I am getting the extension correctly:
In [22]: with default_storage.open(image.file.name, 'rb') as file:
    ...:     extension = filetype.guess_extension(file.read())
    ...:     print(f"Image {image.pk} {image.file} has extension {extension}")
    ...:
[24/Mar/2024 11:32:10] | urllib3.connectionpool._get_conn:292 | DEBUG | Resetting dropped connection: storage.googleapis.com
[24/Mar/2024 11:32:11] | urllib3.connectionpool._make_request:549 | DEBUG | https://storage.googleapis.com:443 "GET /storage/v1/b/staging/o/cms-dev%2Fchair-unsplash_TAJSD0d.jpg?projection=noAcl&prettyPrint=false HTTP/1.1" 200 839
[24/Mar/2024 11:32:11] | urllib3.connectionpool._make_request:549 | DEBUG | https://storage.googleapis.com:443 "GET /download/storage/v1/b/staging/o/cms-dev%2Fchair-unsplash_TAJSD0d.jpg?generation=1710246298001500&alt=media HTTP/1.1" 200 323686
Image 149277 cms-dev/chair-unsplash_TAJSD0d.jpg has extension jpg
  1. I also get the extension if I do the following:
storage = image.file.storage

with storage.open(image.file.name, 'rb') as file:
    extension = filetype.guess_extension(file.read())
    print(f"Image {image.pk} {image.file} has extension {extension}")

# Image 149277 cms-dev/chair-unsplash_TAJSD0d.jpg has extension jpg

Does that mean that the open_file method is not reading the file correctly?

nqcm avatar Mar 13 '24 10:03 nqcm

Just an update, I was able to solve the issue by making changes to the storage class. The package I was using was downloading the image file from Gcloud and using a SpooledTemporaryFile to store it temporarily. See the code.

I changed it o use a NamedTemporaryFile and willow's open_file method started to read the mime type correctly.

May be someone can add compatibility in Willow for SpooledTemporaryFile.

But thanks for all your time and help. Really appreciate it.

nqcm avatar Apr 04 '24 12:04 nqcm