known icon indicating copy to clipboard operation
known copied to clipboard

Handle large files better

Open mapkyca opened this issue 5 years ago • 14 comments

Is your feature request related to a problem? Please describe.

It was raised at the Open Collective meeting last night that certain people were encountering issues using Known in modern hosted environments.

These environments may not have large local storage in order to support traditional direct file uploads, additionally, to handle large files you need to make some fairly substantial config changes (max upload sizes, timeouts etc). Often these changes are verboten for regular folks, and also present security / DOS implications.

Describe the solution you'd like

We need to support traditional uploads for those who self host, and I don't want to be dependant on a separate service. We will also need to maintain the concept of a "file" in the datastore (database, not filestore), as we need to preserve metadata - ownership, tags, etc. But, it would be nice if we could natively support a mechanism to transparently swap out these "local" files for files stored in e.g. an AWS S3 bucket (there is already some functionality for this provided by CDNStorable and the S3 plugin).

Currently:

  • Uploaded files already have URLs
  • They also have the ability (through CDNStorable) to delegate an alternative url to fetch data from (although this could be done better)

So, the fetch side of things is largely handled. There just needs to be a way to hook in and extend file storage. This was traditionally handled on a plugin by plugin basis, and was fairly crufty anyway.

So, I propose we have a dedicated file upload API endpoint.

This might work like this:

  1. A client makes a POST request to an upload url. This request contains no file data, but might contain metadata (name, size, type etc).
  2. This endpoint does validation on the file it's being asked to store, and the user who's requesting the upload.
  3. If everything is ok, it mints a database file object, and returns the file ID, and a url of a location to send the data, and any extra info (signature etc). This URL may be in an entirely different location.
  4. The client then stores that file information and uploads the data to that specific location

Traditional upload will still be supported, but over time we will migrate plugins to support the new world.

I imagine it to work in such a way that the file upload control gets swapped out for a new one which handles this exchange on behalf of the user. Uploading in the background, and when successful inserts the necessary file id information / url in the save form.

The plugin save function will need to be modified to spot this new way of handling uploads.

mapkyca avatar May 07 '20 08:05 mapkyca

I started work on https://gist.github.com/Lewiscowles1986/9078ca97b1a627913e221f59dcf36d22 yesterday hoping to get ready for Github issue once refined.

Lewiscowles1986 avatar May 07 '20 09:05 Lewiscowles1986

The largest problem I see is https://www.w3.org/TR/micropub/#uploading-files

Lewiscowles1986 avatar May 07 '20 09:05 Lewiscowles1986

An open question is how we handle things like thumbnails and stuff.

My thinking is that this is probably better handled through the image proxy layer already in Known, and which already supports resize and other mutations of images. The output of these are cached using whatever backend cache is plugged in.

mapkyca avatar May 07 '20 09:05 mapkyca

The core differences were that I wanted to make the concept of upload server it's own thing (In my mind, but not written down was having this be based on mime-type)

Lewiscowles1986 avatar May 07 '20 09:05 Lewiscowles1986

I was keen to push thumbnailing and media transforms out of known-core, perhaps in-its-place creating first class citizens of variants of content and hooks for after successful upload of attachment, and after posting, with individual and document-form total attachment metadata to allow that to be pushed out to a plugin around a leaner core.

function addVariant(string $attachmentId, string $variantKey, string $kind, string $mime, array $metadata) : void;
function removeVariant(string $attachmentId, string $variantKey) : void;
function updateVariant(string $attachmentId, string $variantKey, string $kind, string $mime, array $metadata) : void;

I was introduced to a service called Imgix, which does a lot of the transforms for image files, and provides a CDN, meaning Known doesn't need to know about CDN's and transforms, as much as variants and hooks after upload & post. I'd imagine audio has variants based on bitrate, etc. Video (not in known, but I know Greg posts a lot of video) via video CDN's. the point is that using a model of see if there is something specific, and fall-back to generic, could allow existing functionality to be preserved, without much effort, while providing hooks to do more and deactivate that side of known.

To look at this in a interface way, I'd propose

function attachmentUploaded(string $url, string $mime, array $metadata) : void;
function afterSavePost(string $id, string $kind, array $post, array $metadata, array $attachments) : void;

Lewiscowles1986 avatar May 07 '20 09:05 Lewiscowles1986

Limitations and vulnerabilities of AIO image libraries are all over the internet. One benefit of shrinking core known API is an ability to deal with security disclosures, even if no people are around to handle.

Lewiscowles1986 avatar May 07 '20 09:05 Lewiscowles1986

It's getting a bit sideways, so should I link to this from another issue, circling the wagons?

Lewiscowles1986 avatar May 07 '20 09:05 Lewiscowles1986

Nvm, I realise now that my approach won't work directly as described. You're right, we probably should be dealing with URLs - files as DB stubs was actually me mis-remembering something from another system I build and maintain :D

They're DB stubs in the case of Mongo, but other file systems just map IDs to something local, and there's a chicken and egg issue with minting these ids which would mean significant changes.

Needs some more thought.

mapkyca avatar May 12 '20 07:05 mapkyca

Probably might want to look more at the JQuery chunked upload approach: https://github.com/blueimp/jQuery-File-Upload/blob/master/server/php/UploadHandler.php

mapkyca avatar May 13 '20 08:05 mapkyca

Is that known to work on Heroku? TBH chunked (abstractly) via persistent stream and client-side blob reading was one of the things I'd considered, and have used in the past.

The problem is, that then you need to perform co-ordination, which is less fun than it sounds.

Multiple chunks (client side)

  • user initiates chunked upload
    • n chunks go to (potentially) n servers
      • unless you pin, which is uck
    • once "done uploading" either an additional call to say "done" or a server-side event
      • either requires count of chunks to be known in first call, or more work after
    • something (async worker most likely) then needs to update the client either though polling or push

Unless you mean HTTP chunked / Streaming, which again has impact tying up servers.

HTTP Chunked / Streaming encoding

    • it's a standard
    • It's a security risk
    • It's a ddos vector
    • Requires server-gateway / interconnect support

There is no free lunch, which is why I'm looking to break uploads out into at-least two plugins. I started with a skeleton yesterday, but it does nothing for-now as I've become entangled in some other projects.

https://docs.min.io/docs/upload-files-from-browser-using-pre-signed-urls.html is a preferred way, but in our case it limits us to S3, or S3-compatible API's, or writing a pre-signed upload server (sounds not fun).

Lewiscowles1986 avatar May 14 '20 13:05 Lewiscowles1986

I think rather than solve all of this and it's complexity, for-now I might look into solving a shorter problem. How to update a one-click deployed app and document some one-click deploy processes.

https://github.com/nknorg/nkn-cloud-image offers AWS, Google Cloud and DigitalOcean packer-based one-click deploys, where mutable FS is a given as well as control over how long to keep connection open for. It does not solve this, but I have not gained a huge amount of traction with this and have found myself with other things to work on.

Lewiscowles1986 avatar Jun 10 '20 05:06 Lewiscowles1986

I'm in favor of both approaches. I think shared hosting as a paradigm has outlived its usefulness, but some people are stuck with it. One-click deploys will certainly really help - but we then also have to consider one click upgrades.

I also think it's easy enough (ish) to create a chunked uploader and plugin API. For default, local filesystem uploads, the mechanics are relatively simple. For other kinds of uploads (eg S3) we need to allow plugins to do their own thing - for example by using the direct S3 chunked upload API.

benwerd avatar Dec 24 '20 20:12 benwerd

I may be wrong, but I think S3 chunked encoding has a minimum 5MB chunk size and a maximum 5GB chunk size.

https://github.com/aws/aws-sdk-php/blob/406fc010414f938ea9011329c213460ec08f6003/src/S3/MultipartUploader.php#L19-L21

https://docs.aws.amazon.com/AmazonS3/latest/dev/qfacts.html

I don't think the 5GB max chunk size will be a problem within a HTTP context, especially not in Heroku.but the minimum...

I Have a feeling this may be why a lot of people DIY their chunking solution or avoid with pre-signed uploads.

Happy Holidays Ben. Glad to see you hear, interested to see what we can put into place.

Lewiscowles1986 avatar Dec 25 '20 13:12 Lewiscowles1986

@mapkyca definitely recommend https://flysystem.thephpleague.com/v2/docs/ it plays well with multiple filesystem types really neatly it also have a list of multiple official and community supported filesystems (including S3 , Dropbox etc) also has support for headstream and write stream can be used with a Entity of Type File (maybe like Idno\Entity\File) to store meta + file for different storage

I have been using it in different projects , works well even when filesystem is swaped from local to S3 in middle of project

ipranjal avatar Dec 12 '21 09:12 ipranjal