cookiecutter-django icon indicating copy to clipboard operation
cookiecutter-django copied to clipboard

Access-controlled service for large static files

Open senderle opened this issue 5 years ago • 3 comments

Description

I'm proposing that a feature be added to serve large static files to authenticated users.

It might not be obvious why this is a problem. Here are some of the possible solution paths, and why they are blocked:

  1. Can't we use a static file service like whitenoise?

    • Whitenoise doesn't provide any kind of authentication or access control, and I'm not even sure it can handle very large files.
  2. Can't Django just serve the files through a FileResponse object?

    • FileResponse objects do a decent job of serving small and medium-sized files, but for very large files, problems arise. (In my case, when files get big enough, I hit a memory error.) It appears that if a given environment has a wsgi.file_wrapper defined, FileResponse objects may use that to efficiently serve access controlled files. But that seems to require that Django be running on the same machine as the web server.
  3. Isn't there some kind of funky thing you can do with headers?

    • Yes! Or rather, there was when cookiecutter-django used Caddy. Caddy supported the X-Accel-Redirect header, and could be configured similarly to nginx (as described here). After the switch to Traefik, this approach no longer works, because Traefik is not a web server at all.
  1. Could you use AWS somehow?

    • Maybe. I haven't looked into this option carefully. But it seems like it would be very complicated to get right.

How should it be implemented? I don't know. This is where I am stuck, and would welcome discussion. I posted a question on stack overflow and got crickets; if you see a way around this that doesn't require a pull request, please feel free to answer there.

Rationale

In a sense, this is not a "feature" but a fix. The change from Caddy to Traefik arguably broke functionality that was working pretty well before.

What it really means for me, concretely, is this: now that I want to do something similar with a new app, I can't use cookiecutter-django without a fairly elaborate and awkward reconfiguration -- something like standing up an nginx container between the django service and the traefik service. If that's the only option, my instinct is to not use cookiecutter-django at all. I probably don't need all the things, and the configuration work will wind up being about the same either way. And maybe that's fine; this could just be a "It might not be what you want" situation.

But I'm proposing the alternative narrative that this would actually fix something that worked before and now is broken. I don't honestly imagine that there are that many people doing what I'm doing, and so I can't argue that you will lose a bunch of users over this. It's just kind of annoying that it used to be easy, and now is hard.

Use case(s) / visualization(s)

Here's my use case: I am developing new apps for researchers at the University of Pennsylvania doing large-scale statistical text analysis in multiple different departments. I need to be able to automatically distribute copyright-protected data to authorized users in bulk, without risking leaking the data.

senderle avatar Jun 06 '19 14:06 senderle

I need to be able to automatically distribute copyright-protected data to authorized users in bulk, without risking leaking the data.

Are you positive these should be distributed using static files and not media files instead? It sounds like this data would be uploaded by your application users to a FileField or ImageField rather than tracked in version control like your code base. Django-storage is providing some options to restrict access.

browniebroke avatar Jun 06 '19 16:06 browniebroke

The data is generated by a crawling process and aggregated into large zip files that the user then downloads. There's no uploading involved. (It's also not tracked in version control.)

But if there are ways to restrict access to the files using some other mechanism that I haven't mentioned above, I'm all ears! It just has to be able to efficiently handle multi-gigabyte files.

senderle avatar Jun 06 '19 18:06 senderle

The data is generated by a crawling process and aggregated into large zip files that the user then downloads. There's no uploading involved.

Ok, so when I have to do that type of things, for me, there is an "upload" invloved at some point, not from a user, but from the crawling process. Here is how I usually handle this (assuming I'm on Docker based config):

  • Create a storage class to store files as private in a AWS S3 bucket:

    rom storages.backends.s3boto3 import S3Boto3Storage
    
    lass PrivateStorage(S3Boto3Storage):
    default_acl = 'private'
    file_overwrite = False
    bucket_name = 'my-private-bucket'
    

    More options detailed in the documentation. You can use AWS_QUERYSTRING_AUTH and AWS_QUERYSTRING_EXPIRE to control access of your files.

  • Create a model with a FileField that will be used to represent these files in my application, using this private storage, for example:

    lass LargeZipFile(models.Model):
    name = models.CharField(max_length=50)
    zip = models.FileField(
    	storage=PrivateStorage()
    )
    
  • Use Celery to generate the data, and when files are ready, create instances of LargeZipFile to upload the files.

Each time a user wants to download a file, your application exposes the LargeZipFile.zip.url on some page, which will have query parameters giving access for a short amount of time.

That being said, I don't know how your server is deployed at the University of Pennsylvania, it might be on a dedicated, non-cloud server. I don't know which are your storages options, but if AWS is not suitable, Digital Ocean might be and has a compatible API, which is supported by django-storages.

It just has to be able to efficiently handle multi-gigabyte files.

A word of warning that it could generate some significant costs from Amazon.

browniebroke avatar Jun 10 '19 08:06 browniebroke