tusd icon indicating copy to clipboard operation
tusd copied to clipboard

Implementation of the Expiration extension

Open Acconut opened this issue 3 years ago • 8 comments

tusd does currently not support the Expiration extension and people have to rely on custom scripts or provider-specific approaches to clean up old uploads. This issue is intended to serve as a discussion place for a possible implementation of this extension. In the end, tusd should be able to do two things:

  • Provide the expiration time of an upload in the HEAD response to clients, and
  • Be able to cleanup (finished and unfinished) uploads after a given time.

Sending the expiration time to the client should not be a though nut to crack. However, when looking at the actual clean up work, it could be hairy to find a solution for the various storage options that tusd supports:

  • For the file storage, it is probably the best approach to iterate through all uploads on disk and determine which upload needs to be cleaned up. This could be implemented as an additional binary (maybe named tusd-cleaner) which the user must configure to run regularly as a cron job.
  • When looking at the S3 and GCS storage, the cloud providers also offer some expiration mechanism built into their platform. For example, AWS S3 allows to delete files after a certain time. However, this mechanism has a low granularity since the expiration can only be configured in days. This approach is not helpful if someone wants to delete uploads after two hours. For these cloud providers we should consider providing two mechanisms: Either let the expiration be handled by the cloud provider itself or run the tusd-cleaner regularly which will iterate through your cloud bucket and find old uploads.

One might ask, we do we need two binaries (tusd for accepting uploads and tusd-cleaner for clean up old uploads)? Why can't we integrate the cleaning into tusd directly? Well, if you have multiple tusd servers running in parallel for redundancy and load balancing, you don't want every one of these tusd instances to scan through your upload directory at the same time. It would be better to let this be handled by one server with a cron job.

On the implementation side we would need following:

  • A new directory for our tusd-cleaner binary in https://github.com/tus/tusd/tree/master/cmd
  • A new datastore interfacer for iterating through existing uploads in https://github.com/tus/tusd/blob/master/pkg/handler/datastore.go
  • Implementations for the new datastore interface in our storages (maybe try filestore first and later s3store and gcsstore)
  • A new method for tusd to retrieve the expiration time of an uploads and include in the response to the client

These are just some ideas from me. I am open to hearing other thoughts and approaches!

Acconut avatar Oct 14 '20 21:10 Acconut

@Acconut Is this still open? I can pitch in for the design, discussion and implementation.

DravitLochan avatar May 25 '21 10:05 DravitLochan

@DravitLochan Yes, this task is still open. Any help is appreciated!

Acconut avatar May 25 '21 10:05 Acconut

@Acconut Shouldn't the overall file expiry and expiry of partial uploads be two different features? IMO TUS being a protocol (or a pseudo protocol), the overall file expiry should be out of the scope of TUS. That's something the applications should handle using a cron. Thoughts?

DravitLochan avatar May 25 '21 20:05 DravitLochan

@DravitLochan I think your point of keeping different expiry for completed and partial uploads can be finely grained to configuration which allows user to define

  • An overall expiry (based on first creation time of file)
  • Different expiry for completed & partial uploads

One another opinion I would like to add is to clean uploads based on max available system disk (e.g. If the disk consumed is >90% then it should automatically delete the older uploads/start rejecting the new uploads).

alter123 avatar May 26 '21 02:05 alter123

One another opinion I would like to add is to clean uploads based on max available system disk (e.g. If the disk consumed is >90% then it should automatically delete the older uploads/start rejecting the new uploads).

Well, I am not sure of this as a feature for resumable-uploads.

Also as a backend dev, to start deleting the older uploads on disk hitting 90% seems like an unacceptable thing to me. I might start to lose some critical files and hence my end user will start to churn.

If you want to solve the above problem, I'd say the best way would be to have alerts in place for disk hitting 80, 85, 90 and 95%. In-fact if you run out of space, let the application handle it on it's end "due to high traffic, we are not allowing any new file uploads. We will be back up in no time".

DravitLochan avatar May 26 '21 05:05 DravitLochan

@DravitLochan Your point makes sense, and rejecting the uploads sounds like correct option, but not sure if we need implementation for this at the moment, @Acconut can clarify.

Implementation wise I think storing expiry in FileInfo struct makes sense, since every FileSystem can have different abstraction for the same.

Also I'm not sure how we can parse common flags for config etc across the binaries, since most of them will be same in tusd & tusd-cleaner

alter123 avatar May 26 '21 07:05 alter123

@Acconut what do you think?

DravitLochan avatar May 28 '21 19:05 DravitLochan

TUS being a protocol (or a pseudo protocol), the overall file expiry should be out of the scope of TUS.

Yes, tus should not be concerned about the expiry of files after the upload is completed, but tusd can definitely be concerned about this. There are multiple areas where tusd offers more features than the tus protocol specifies, just because those are handy and fit well into tusd. So, I see not problem when tusd offers expiration of uploads and files in one convenient package. Does that make sense? :)

One another opinion I would like to add is to clean uploads based on max available system disk

Interesting idea, but I think that is outside of tusd's scope and is very application specific. This does not seem like a good fit for tusd, as @DravitLochan already mentioned.

rejecting the uploads sounds like correct option, but not sure if we need implementation for this at the moment,

You can already check the free space of the disk in the pre-create hook and reject uploads if the disk is too full. There is not additional implementation needed inside tusd.

Also I'm not sure how we can parse common flags for config etc across the binaries, since most of them will be same in tusd & tusd-cleaner

That should not be a big problem. The parsing and/or validation of flags can be extracted into a shared module, which is imported by both programs (if that makes sense in the case).

Acconut avatar Jun 07 '21 21:06 Acconut