nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

Set Content-Type to text/html for HTML file upload to S3 in publishDir directive

Open birnbera opened this issue 3 years ago • 17 comments

New feature

Currently, when using publishDir to send results to an s3 destination, all of the files are getting the Content-Type set to application/octet-stream. This is fine for data results, but with, e.g. MultiQC HTML reports it is the wrong type and should be text/html, instead.

Usage scenario

This behavior makes it difficult to serve the results directly from s3, i.e. displaying the html report in the browser using the object URL. Since the files have the Content-Type set to application/octet-stream, most browsers will respect this and force you to download the file instead of rendering it.

Suggest implementation

I couldn't find the Nextflow code where s3 upload from a publishDir directive is actually taking place. I suspect it is related to the code for moving files between different "providers" with a foreign "target". However, the simplest implementation would be to infer the mime-type based on the file extension and use that for the Content-Type property. A more complicated implementation would expose some of these s3 upload knobs in the publishDir directive.

birnbera avatar Aug 07 '20 22:08 birnbera

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jan 04 '21 23:01 stale[bot]

Can this issue be reopened?

zdk123 avatar Mar 04 '22 18:03 zdk123

Here are a few options to infer content type in Groovy:

And I believe this is the class that needs to be updated: https://github.com/nextflow-io/nextflow/blob/master/buildSrc/src/main/groovy/io/nextflow/gradle/tasks/S3Upload.groovy

bentsherman avatar Apr 22 '22 14:04 bentsherman

I agree, it would be nice to support this. It could be added a contentType attribute as show below

process foo {
  publishDir 's3://some/data', contentType: 'text/html' 

   '''
   some_command 
   '''
}

An alternative name could be mimeType to make it even more explicit.

pditommaso avatar Apr 25 '22 13:04 pditommaso

@pditommaso it would be great to have a mechanism for content-type inference based on file extension. For instance, if the published directory contains html, JavaScript, css, etc, setting one content type will obviously fail.

zdk123 avatar Apr 25 '22 14:04 zdk123

My preference would be to add an optional argument to publishDir, such as contentType or mimeType, which is only used by storage plugins, such as S3 or GCP, for which content types are relevant. I'd rather not have default content type inference implemented within Nextflow, however it would be fine to allow default inference by the underlying, e.g. AWS, SDK when no content type is specified, since this is standard behavior and is storage-backend specific.

birnbera avatar Apr 25 '22 22:04 birnbera

I'm inclined to think it would be better to explicitly set it. However, I can also see use cases in which it may be useful to have it automatically assigned, for example when publishing a dir containing both HTML, and images files. In this case, it can be overkilling to specify the content type file-by-file.

In this case, it could be possible to have contentType: true and nextflow automatically infers it from the extension. When is explicitly set as a string e.g. contentType: 'text/html' just use it.

pditommaso avatar Apr 26 '22 06:04 pditommaso

For inferring the content type, after looking around a bit I'm thinking it would be easiest to just have a map of common file extensions -> mime types. If people have extra cases they can always submit a PR.

bentsherman avatar Apr 26 '22 12:04 bentsherman

I think there's a popular library for that, tho don't remember the name @jorgeaguileraseqera do you remember it?

pditommaso avatar Apr 26 '22 13:04 pditommaso

Found Apache Tika https://www.baeldung.com/java-file-mime-type#apache-tika

Actually, it looks there's a Java native solution, which should be our fav choice

https://www.baeldung.com/java-file-mime-type#java-7

pditommaso avatar Apr 26 '22 13:04 pditommaso

I agree, it would be nice to have a new process directive where you could provide a filename (or pattern/extension) to content-type map that can be configured globally or for individual processes/labels/etc.

A nice implementation could be that passing a boolean turns auto-inference on/off, a string sets the value directly, or a map would be used for lookup. A built-in map or library could be used when the value is set to true, or a user specified one if provided (either directly or at a higher level of the configuration hierarchy). However the default behavior should be to let the SDK decide (i.e. AWS or GCP).

Not sure if this is being considered, but I think adding a new argument to publishDir would be more confusing than a new contentType (or similar) directive.

birnbera avatar Apr 26 '22 19:04 birnbera

@birnbera I'm guessing the issue with that is the SDK isn't being used to transfer files, but rather a pseudo network drive like s3fs

zdk123 avatar Apr 26 '22 19:04 zdk123

Tentative implementation 0d29df0d

pditommaso avatar May 08 '22 20:05 pditommaso

awesome @pditommaso. I would be happy to give it a test drive - what option arguments did you end up going with?

zdk123 avatar May 09 '22 15:05 zdk123

I tried this out today, setting contentType: true to the publishDir directive of an html file, and it worked great. Would be happy to see it appear in a future release! Many thanks.

odoublewen avatar Jun 20 '22 22:06 odoublewen

@pditommaso This is a desirable feature and seems to work great in my testing. Just curious, are you considering it on a roadmap for future release? Many thanks.

odoublewen avatar Jul 19 '22 18:07 odoublewen

we'll try to include in the next release

pditommaso avatar Jul 21 '22 15:07 pditommaso

Merged finally. 02afa332

pditommaso avatar Oct 02 '22 15:10 pditommaso

Should contentType: true work recursively on path outputs (meaning all the files under a subdirectory)?

zdk123 avatar Feb 15 '23 22:02 zdk123

Does this work for GCS?

dshinzie avatar Apr 14 '23 20:04 dshinzie

No, only S3 at the moment

bentsherman avatar Apr 14 '23 20:04 bentsherman