cookiecutter-django
cookiecutter-django copied to clipboard
Access-controlled service for large static files
Description
I'm proposing that a feature be added to serve large static files to authenticated users.
It might not be obvious why this is a problem. Here are some of the possible solution paths, and why they are blocked:
-
Can't we use a static file service like whitenoise?
- Whitenoise doesn't provide any kind of authentication or access control, and I'm not even sure it can handle very large files.
-
Can't Django just serve the files through a
FileResponse
object?-
FileResponse
objects do a decent job of serving small and medium-sized files, but for very large files, problems arise. (In my case, when files get big enough, I hit a memory error.) It appears that if a given environment has awsgi.file_wrapper
defined,FileResponse
objects may use that to efficiently serve access controlled files. But that seems to require that Django be running on the same machine as the web server.
-
-
Isn't there some kind of funky thing you can do with headers?
- Yes! Or rather, there was when
cookiecutter-django
used Caddy. Caddy supported theX-Accel-Redirect
header, and could be configured similarly to nginx (as described here). After the switch to Traefik, this approach no longer works, because Traefik is not a web server at all.
- Yes! Or rather, there was when
-
Could you use AWS somehow?
- Maybe. I haven't looked into this option carefully. But it seems like it would be very complicated to get right.
How should it be implemented? I don't know. This is where I am stuck, and would welcome discussion. I posted a question on stack overflow and got crickets; if you see a way around this that doesn't require a pull request, please feel free to answer there.
Rationale
In a sense, this is not a "feature" but a fix. The change from Caddy to Traefik arguably broke functionality that was working pretty well before.
What it really means for me, concretely, is this: now that I want to do something similar with a new app, I can't use cookiecutter-django without a fairly elaborate and awkward reconfiguration -- something like standing up an nginx container between the django service and the traefik service. If that's the only option, my instinct is to not use cookiecutter-django at all. I probably don't need all the things, and the configuration work will wind up being about the same either way. And maybe that's fine; this could just be a "It might not be what you want" situation.
But I'm proposing the alternative narrative that this would actually fix something that worked before and now is broken. I don't honestly imagine that there are that many people doing what I'm doing, and so I can't argue that you will lose a bunch of users over this. It's just kind of annoying that it used to be easy, and now is hard.
Use case(s) / visualization(s)
Here's my use case: I am developing new apps for researchers at the University of Pennsylvania doing large-scale statistical text analysis in multiple different departments. I need to be able to automatically distribute copyright-protected data to authorized users in bulk, without risking leaking the data.
I need to be able to automatically distribute copyright-protected data to authorized users in bulk, without risking leaking the data.
Are you positive these should be distributed using static files and not media files instead? It sounds like this data would be uploaded by your application users to a FileField
or ImageField
rather than tracked in version control like your code base. Django-storage is providing some options to restrict access.
The data is generated by a crawling process and aggregated into large zip files that the user then downloads. There's no uploading involved. (It's also not tracked in version control.)
But if there are ways to restrict access to the files using some other mechanism that I haven't mentioned above, I'm all ears! It just has to be able to efficiently handle multi-gigabyte files.
The data is generated by a crawling process and aggregated into large zip files that the user then downloads. There's no uploading involved.
Ok, so when I have to do that type of things, for me, there is an "upload" invloved at some point, not from a user, but from the crawling process. Here is how I usually handle this (assuming I'm on Docker based config):
-
Create a storage class to store files as private in a AWS S3 bucket:
rom storages.backends.s3boto3 import S3Boto3Storage lass PrivateStorage(S3Boto3Storage): default_acl = 'private' file_overwrite = False bucket_name = 'my-private-bucket'
More options detailed in the documentation. You can use
AWS_QUERYSTRING_AUTH
andAWS_QUERYSTRING_EXPIRE
to control access of your files. -
Create a model with a
FileField
that will be used to represent these files in my application, using this private storage, for example:lass LargeZipFile(models.Model): name = models.CharField(max_length=50) zip = models.FileField( storage=PrivateStorage() )
-
Use Celery to generate the data, and when files are ready, create instances of
LargeZipFile
to upload the files.
Each time a user wants to download a file, your application exposes the LargeZipFile.zip.url
on some page, which will have query parameters giving access for a short amount of time.
That being said, I don't know how your server is deployed at the University of Pennsylvania, it might be on a dedicated, non-cloud server. I don't know which are your storages options, but if AWS is not suitable, Digital Ocean might be and has a compatible API, which is supported by django-storages
.
It just has to be able to efficiently handle multi-gigabyte files.
A word of warning that it could generate some significant costs from Amazon.