django-storages icon indicating copy to clipboard operation
django-storages copied to clipboard

Performance of the url() method

Open avorio opened this issue 5 years ago • 19 comments

Hi there,

I'm wondering if there's a performance opportunity around the url() method used in templates, views, etc., to get the full URL of an Image stored in, say, Amazon S3. It seems to me that the storage engine will do much more than simply put together the file name and MEDIA_URL.

When I'm dealing with publicly-accessible items on S3, would I be better off simply putting those two values together instead of calling the url() method?

I'm using Django REST Framework and need to list 100+ images' URLs in one single request, which is currently heavy via the url() method — that was my motivator to try accessing these values directly.

What am I missing?

Many thanks!

avorio avatar Apr 22 '19 20:04 avorio

Anyone?

avorio avatar May 03 '19 19:05 avorio

@avorio I typically use Google storage and it has been optimized for public files. Unless you tell the storage that all the files are going to be Public, it has to make a round trip to AWS to check permissions, so it is going to be slow.

sww314 avatar May 03 '19 19:05 sww314

I've experienced this same issue recently (mostly due to generating different image sizes + thumbnails), endpoints were significantly (~10x) slower regardless of any prefetch related shenanigans attempted.

As a quick fix I've simply implemented custom serializers to build the URLs with the bucket url + file names, e.g., f'{settings.DEFAULT_BUCKET_URL}/{instance.file.name}'. The next step was extending the S3Boto3Storage class and overriding its url(...) method.

Building the URL straight on the serializer ended up in better performance than overriding the S3Boto3Storage class though.

henriquesalvaro avatar May 23 '19 02:05 henriquesalvaro

I was having trouble figuring this one out. I came up with a solution using something similar.

To make this easier for anyone else I created a package for pip with these mods: https://pypi.org/project/sorl-thumbnail-jdmueller/

And just add these to settings.py: MEDIA_URL = "https://" + os.environ.get("AWS_STORAGE_BUCKET_NAME", "bucket-name") + ".s3.amazonaws.com/" THUMBNAIL_FAST_URL = True

jdmueller avatar Aug 09 '19 06:08 jdmueller

In it's current implementation it seems that specifying AWS_S3_CUSTOM_DOMAIN (to bucket domain name) skips any unnecessary communication and basically returns the formatted url immediately, while respecting custom location and file name normalization.

wodCZ avatar Oct 07 '19 09:10 wodCZ

This is a naive implementation of how we can bypass any checks to gcloud for example We can also avoid even initializing gcloud storage like this.

from django.core.exceptions import SuspiciousOperation
from django.core.files.storage import Storage
from django.utils.encoding import smart_str
from storages.utils import (
	clean_name, setting,
	safe_join)

class CloudStorage(Storage):
	bucket_path = setting('GS_BUCKET_PATH')
	file_name_charset = setting('GS_FILE_NAME_CHARSET', 'utf-8')
	location = setting('GS_LOCATION', '')
	settings = None

	@property
	def manager(self):
		if self._manager is None:
			from storages.backends.gcloud import GoogleCloudStorage
			self._manager = GoogleCloudStorage(**self.settings)
		return self._manager

	def __init__(self, **settings):
		self.settings = settings
		self._manager = None

	def _normalize_name(self, name):
		try:
			return safe_join(self.location, name)
		except ValueError:
			raise SuspiciousOperation("Attempted access to '%s' denied." %
									  name)

	def _encode_name(self, name):
		return smart_str(name, encoding=self.file_name_charset)

	def url(self, name):
		if self.bucket_path is not None:
			name = self._encode_name(self._normalize_name(clean_name(name)))
			full_name = self.bucket_path + name
			return full_name
		return self.manager.url(name)

	def _open(self, name, mode='rb'):
		return self.manager._open(name, mode)
	def _save(self, name, content):
		return self.manager._save(name, content)
	def delete(self, name):
		return self.manager.delete(name)
	def exists(self, name):
		return self.manager.exists(name)
	def listdir(self, name):
		return self.manager.listdir(name)
	def size(self, name):
		return self.manager.size(name)
	def get_modified_time(self, name):
		return self.manager.get_modified_time(name)
	def get_created_time(self, name):
		return self.manager.get_created_time(name)
	def get_available_name(self, name, max_length=None):
		return self.manager.get_available_name(name, max_length)

pmabres avatar Mar 02 '20 02:03 pmabres

@jdmueller I do not understand how your solution works, could you explain please?

Looks like you use FileSystemStorage to form urls, but I am not sure why this should work - is not it some different Storage at all?

https://github.com/jdmueller/sorl-thumbnail-jdmueller/commit/0f4d1f787d7f29647965a8d3ff3cad92dccb45f9#diff-e3d22d454c3df40ff35f95407205e56def3713e2dd9351768179f406cd2b2cf1L83

uhbif19 avatar Apr 21 '21 14:04 uhbif19

Since my app have not only public, but also private files, I just add some caching for urls.

Solution is not properly tested, but it works and makes endpoints fast.

from django.core.cache import cache

from storages.backends.s3boto3 import S3Boto3Storage


class S3Storage(S3Boto3Storage):
    """Storage where signed URLs are cached."""

    def url(self, name, *args):
        normalized_name = self._normalize_name(self._clean_name(name))
        cached_url = cache.get(normalized_name)
        if cached_url is not None:
            return cached_url
        else:
            url = super().url(name, args)
            cache.add(normalized_name, url, self.querystring_expire)
            return url

uhbif19 avatar Apr 21 '21 14:04 uhbif19

@uhbif19 the caching is good solution. However, you want to cache the urls less time that they are valid. Your example code as written can return a url that will expire before it is used.

I typically cache for half the life of the url. If the URLs are valid for 24 hours then cache the link for 12. The actual values depend on your use case.

sww314 avatar Apr 21 '21 20:04 sww314

@uhbif19

Thanks for comment. Yes, I planned this as well as of cache warming.

Why that 12 hours difference would be useful? I thought that it would be enough to add a couple of seconds to compensate the difference between the time link was generated and time it was added to cache.

uhbif19 avatar Apr 22 '21 04:04 uhbif19

It depends on your use case on the client side. If you are simply rendering images in the browser it may not matter. However, it you are displaying a table of links for the user to click on. You are creating are hard to debug case, where sometimes the user will be given links that are expired. Or are expired by the time the user clicks on them.

In the extreme case, the server generates the links at midnight (00:00:00). The link expiration and cache are both set to 24 hours. The user goes to page at 23:59:59. By the time, the link is used it will almost certainly be expired. If you set your cache time-to-live to a number much smaller than the link time to live you will always send valid links.

sww314 avatar Apr 22 '21 14:04 sww314

Hey, we face a similar issue of bad performance when trying to get URLs for a lot of objects at once. After checking what happens, the sentry shows us that we are trying to retrieve an URL for every object. And for every such object, it is going to CloudFront and is making the HEAD requests. That's the case even if we have turned off the querystring_auth parameter. This is strange because according to the version we are using, it shouldn't be making any requests at all and should be just forming the URL. Maybe there are some other under the hood things going on, which I am missing. It would be great if you can point me out where the HTTP HEAD requests might be happening.

But, overall, are there any ideas on how performance for a lot(50+) objects in a single request can be improved? I am looking into the direction of these things:

  • https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/private-content-signed-cookies.html
  • https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/private-content-creating-signed-url-custom-policy.html

But I am not 100% sure if such solutions are compatible with this library. If someone already encountered such an issue and has tackled it, it would be great if you can share!


Update: Just a quick update from my side – the problem with performance was not because of signing multiple objects. As wodCZ mentioned the signing is happening locally. The problem was because we were doing the whole roundtrip to get the dimensions of an image for each object.

0xRaduan avatar May 03 '21 12:05 0xRaduan

The code causing the requests is here.

Defining AWS_S3_CUSTOM_DOMAIN makes it run through the fast branch. I haven't tried it in combination with AWS_CLOUDFRONT_KEY_ID, but it looks like then the url is signed locally using RSA, which should be faster than making HTTP request for each file.

Edit: More specifically, I suspect this line:

params['Bucket'] = self.bucket.name

which runs this:

    @property
    def bucket(self):
        """
        Get the current bucket. If there is no current bucket object
        create it.
        """
        if self._bucket is None:
            self._bucket = self.connection.Bucket(self.bucket_name)
        return self._bucket

and that ends with the request here:


    @property
    def connection(self):
        connection = getattr(self._connections, 'connection', None)
        if connection is None:
            session = boto3.session.Session()
            self._connections.connection = session.resource(
                's3',
                aws_access_key_id=self.access_key,
                aws_secret_access_key=self.secret_key,
                aws_session_token=self.security_token,
                region_name=self.region_name,
                use_ssl=self.use_ssl,
                endpoint_url=self.endpoint_url,
                config=self.config,
                verify=self.verify,
            )
        return self._connections.connection

Although I'm not 100% sure - the bucket property should be only initialized once 🤔

Edit 2: nevermind, I got lost in the rabbit hole 🤦

wodCZ avatar May 03 '21 12:05 wodCZ

@Raduan77 Сan you please explain how you made Sentry capture this?

After checking what happens, the sentry shows us that we are trying to retrieve an URL for every object.

uhbif19 avatar May 03 '21 12:05 uhbif19

@wodCZ, thanks for pointing it out. It seems like we are already using AWS_CLOUDFRONT_KEY_ID and AWS_CLOUDFRONT_KEY, and it still performs the requests. Regarding your comment about bucket – we have AWS_S3_CUSTOM_DOMAIN defined, so that shouldn't be the case, I believe, since as you've mentioned, it is going through the fast route and doesn't need to perform these lines


@uhbif19, we haven't done any specific things. Sentry is tracking every transaction by default in our service, we just followed through the initial setting tutorial provided by a sentry. After that, you can check every transaction on the Performance page. I am adding a screenshot of how it looks. Screenshot 2021-05-03 at 15 42 24

0xRaduan avatar May 03 '21 12:05 0xRaduan

I noticed today I am getting the same effect. In my case at first I noticed in https://scoutapm.com/ that there are HEAD requests going out for each one of my images in a template:

Screen Shot 2022-01-07 at 5 52 33 PM

Interestingly, I also have Sentry installed (but I was only using it for Errors), so I also decided to check its Performance tab and I can also see the HEAD requests. It does show a bit more than @Raduan77 's screenshot, perhaps a newer version has more details. I can see it's calling aws.s3.HeadObject.

I edited the url method of s3boto3.py to add a few prints and I can confirm I have a custom_domain and it's returning the url earlier. So, something else is making the call. I think this rules out the .url method. I am still not sure what's generating the HEAD requests. There's some documentation at boto3 saying HeadObject will be called if filtering buckets but I don't see django-storages doing any of that.

Will try debug this more later.

inerte avatar Jan 08 '22 02:01 inerte

Okay, maybe storages is doing indirectly through the django's BaseStorage - boto3 docs in Filtering say:

Behind the scenes, the above example will call ListBuckets, ListObjects, and HeadObject many times.

Looking for where Prefix is used in storages, it's in listdir: https://github.com/jschneier/django-storages/blob/cf50a22d76f1ab3294e85c0a753a7c369663b293/storages/backends/s3boto3.py#L483

But storages never call listdir itself, and I can't find exactly where Django itself would call the storages one.

So that's my best guess for now, somehow storages' listdir is being called, and filtering Buckets, which makes boto3 call HeadObject - but I am really having a hard time reproducing any of this as of now. Hopefully I will have more later.

inerte avatar Jan 08 '22 17:01 inerte

Alright! I have more data. It's not url (as suspected before), it's height (or width) on an image. Check the stacktrace, Django's ._get_image_dimensions ends up calling storages .open, which invokes boto3 and _make_api_call from botocore/client.py is getting called with HEAD:

Screen Shot 2022-01-09 at 3 37 15 PM

The solution is simple, don't call .height nor .image directly from an ImageField, but use its height_field and width_field and always use that instead - interestingly, I did have those 2 fields, but was mistakenly calling the ImageField methods.

Sorry for so many messages, as I really wanted to fix this. I could feel some of my pages being slow and I couldn't figure out why, once I saw the HEAD requests going out, it led me to this thread and the rabbit hole. My site feels snappy now, a page with 5 images was taking ~2.5 seconds to load, and now ~600ms. Yay!

inerte avatar Jan 10 '22 00:01 inerte

I'm not sure if this is the same issue, but I'm running into an issue where the ImageField's update_dimension_fields is being called hundreds of times on a single request:

Screenshot 2023-02-26 at 6 46 31 PM

I've tried to avoid calling instance.image.url and instead have opted to call instance.image.name and build the url manually as suggested, but it doesn't seem to be working.

Any ideas if this is related to the same thing?

aaronn avatar Feb 27 '23 02:02 aaronn