Handle large files better
Is your feature request related to a problem? Please describe.
It was raised at the Open Collective meeting last night that certain people were encountering issues using Known in modern hosted environments.
These environments may not have large local storage in order to support traditional direct file uploads, additionally, to handle large files you need to make some fairly substantial config changes (max upload sizes, timeouts etc). Often these changes are verboten for regular folks, and also present security / DOS implications.
Describe the solution you'd like
We need to support traditional uploads for those who self host, and I don't want to be dependant on a separate service. We will also need to maintain the concept of a "file" in the datastore (database, not filestore), as we need to preserve metadata - ownership, tags, etc. But, it would be nice if we could natively support a mechanism to transparently swap out these "local" files for files stored in e.g. an AWS S3 bucket (there is already some functionality for this provided by CDNStorable and the S3 plugin).
Currently:
- Uploaded files already have URLs
- They also have the ability (through CDNStorable) to delegate an alternative url to fetch data from (although this could be done better)
So, the fetch side of things is largely handled. There just needs to be a way to hook in and extend file storage. This was traditionally handled on a plugin by plugin basis, and was fairly crufty anyway.
So, I propose we have a dedicated file upload API endpoint.
This might work like this:
- A client makes a POST request to an upload url. This request contains no file data, but might contain metadata (name, size, type etc).
- This endpoint does validation on the file it's being asked to store, and the user who's requesting the upload.
- If everything is ok, it mints a database file object, and returns the file ID, and a url of a location to send the data, and any extra info (signature etc). This URL may be in an entirely different location.
- The client then stores that file information and uploads the data to that specific location
Traditional upload will still be supported, but over time we will migrate plugins to support the new world.
I imagine it to work in such a way that the file upload control gets swapped out for a new one which handles this exchange on behalf of the user. Uploading in the background, and when successful inserts the necessary file id information / url in the save form.
The plugin save function will need to be modified to spot this new way of handling uploads.
I started work on https://gist.github.com/Lewiscowles1986/9078ca97b1a627913e221f59dcf36d22 yesterday hoping to get ready for Github issue once refined.
The largest problem I see is https://www.w3.org/TR/micropub/#uploading-files
An open question is how we handle things like thumbnails and stuff.
My thinking is that this is probably better handled through the image proxy layer already in Known, and which already supports resize and other mutations of images. The output of these are cached using whatever backend cache is plugged in.
The core differences were that I wanted to make the concept of upload server it's own thing (In my mind, but not written down was having this be based on mime-type)
I was keen to push thumbnailing and media transforms out of known-core, perhaps in-its-place creating first class citizens of variants of content and hooks for after successful upload of attachment, and after posting, with individual and document-form total attachment metadata to allow that to be pushed out to a plugin around a leaner core.
function addVariant(string $attachmentId, string $variantKey, string $kind, string $mime, array $metadata) : void;
function removeVariant(string $attachmentId, string $variantKey) : void;
function updateVariant(string $attachmentId, string $variantKey, string $kind, string $mime, array $metadata) : void;
I was introduced to a service called Imgix, which does a lot of the transforms for image files, and provides a CDN, meaning Known doesn't need to know about CDN's and transforms, as much as variants and hooks after upload & post. I'd imagine audio has variants based on bitrate, etc. Video (not in known, but I know Greg posts a lot of video) via video CDN's. the point is that using a model of see if there is something specific, and fall-back to generic, could allow existing functionality to be preserved, without much effort, while providing hooks to do more and deactivate that side of known.
To look at this in a interface way, I'd propose
function attachmentUploaded(string $url, string $mime, array $metadata) : void;
function afterSavePost(string $id, string $kind, array $post, array $metadata, array $attachments) : void;
Limitations and vulnerabilities of AIO image libraries are all over the internet. One benefit of shrinking core known API is an ability to deal with security disclosures, even if no people are around to handle.
It's getting a bit sideways, so should I link to this from another issue, circling the wagons?
Nvm, I realise now that my approach won't work directly as described. You're right, we probably should be dealing with URLs - files as DB stubs was actually me mis-remembering something from another system I build and maintain :D
They're DB stubs in the case of Mongo, but other file systems just map IDs to something local, and there's a chicken and egg issue with minting these ids which would mean significant changes.
Needs some more thought.
Probably might want to look more at the JQuery chunked upload approach: https://github.com/blueimp/jQuery-File-Upload/blob/master/server/php/UploadHandler.php
Is that known to work on Heroku? TBH chunked (abstractly) via persistent stream and client-side blob reading was one of the things I'd considered, and have used in the past.
The problem is, that then you need to perform co-ordination, which is less fun than it sounds.
Multiple chunks (client side)
- user initiates chunked upload
- n chunks go to (potentially) n servers
- unless you pin, which is uck
- once "done uploading" either an additional call to say "done" or a server-side event
- either requires count of chunks to be known in first call, or more work after
- something (async worker most likely) then needs to update the client either though polling or push
- n chunks go to (potentially) n servers
Unless you mean HTTP chunked / Streaming, which again has impact tying up servers.
HTTP Chunked / Streaming encoding
-
- it's a standard
-
- It's a security risk
-
- It's a ddos vector
-
- Requires server-gateway / interconnect support
There is no free lunch, which is why I'm looking to break uploads out into at-least two plugins. I started with a skeleton yesterday, but it does nothing for-now as I've become entangled in some other projects.
https://docs.min.io/docs/upload-files-from-browser-using-pre-signed-urls.html is a preferred way, but in our case it limits us to S3, or S3-compatible API's, or writing a pre-signed upload server (sounds not fun).
I think rather than solve all of this and it's complexity, for-now I might look into solving a shorter problem. How to update a one-click deployed app and document some one-click deploy processes.
https://github.com/nknorg/nkn-cloud-image offers AWS, Google Cloud and DigitalOcean packer-based one-click deploys, where mutable FS is a given as well as control over how long to keep connection open for. It does not solve this, but I have not gained a huge amount of traction with this and have found myself with other things to work on.
I'm in favor of both approaches. I think shared hosting as a paradigm has outlived its usefulness, but some people are stuck with it. One-click deploys will certainly really help - but we then also have to consider one click upgrades.
I also think it's easy enough (ish) to create a chunked uploader and plugin API. For default, local filesystem uploads, the mechanics are relatively simple. For other kinds of uploads (eg S3) we need to allow plugins to do their own thing - for example by using the direct S3 chunked upload API.
I may be wrong, but I think S3 chunked encoding has a minimum 5MB chunk size and a maximum 5GB chunk size.
https://github.com/aws/aws-sdk-php/blob/406fc010414f938ea9011329c213460ec08f6003/src/S3/MultipartUploader.php#L19-L21
https://docs.aws.amazon.com/AmazonS3/latest/dev/qfacts.html
I don't think the 5GB max chunk size will be a problem within a HTTP context, especially not in Heroku.but the minimum...
I Have a feeling this may be why a lot of people DIY their chunking solution or avoid with pre-signed uploads.
Happy Holidays Ben. Glad to see you hear, interested to see what we can put into place.
@mapkyca definitely recommend https://flysystem.thephpleague.com/v2/docs/ it plays well with multiple filesystem types really neatly it also have a list of multiple official and community supported filesystems (including S3 , Dropbox etc) also has support for headstream and write stream can be used with a Entity of Type File (maybe like Idno\Entity\File) to store meta + file for different storage
I have been using it in different projects , works well even when filesystem is swaped from local to S3 in middle of project