nextflow
nextflow copied to clipboard
Set Content-Type to text/html for HTML file upload to S3 in publishDir directive
New feature
Currently, when using publishDir
to send results to an s3
destination, all of the files are getting the Content-Type
set to application/octet-stream
. This is fine for data results, but with, e.g. MultiQC HTML reports it is the wrong type and should be text/html
, instead.
Usage scenario
This behavior makes it difficult to serve the results directly from s3
, i.e. displaying the html report in the browser using the object URL. Since the files have the Content-Type
set to application/octet-stream
, most browsers will respect this and force you to download the file instead of rendering it.
Suggest implementation
I couldn't find the Nextflow code where s3
upload from a publishDir
directive is actually taking place. I suspect it is related to the code for moving files between different "providers" with a foreign "target". However, the simplest implementation would be to infer the mime-type based on the file extension and use that for the Content-Type
property. A more complicated implementation would expose some of these s3
upload knobs in the publishDir
directive.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Can this issue be reopened?
Here are a few options to infer content type in Groovy:
And I believe this is the class that needs to be updated: https://github.com/nextflow-io/nextflow/blob/master/buildSrc/src/main/groovy/io/nextflow/gradle/tasks/S3Upload.groovy
I agree, it would be nice to support this. It could be added a contentType
attribute as show below
process foo {
publishDir 's3://some/data', contentType: 'text/html'
'''
some_command
'''
}
An alternative name could be mimeType
to make it even more explicit.
@pditommaso it would be great to have a mechanism for content-type inference based on file extension. For instance, if the published directory contains html, JavaScript, css, etc, setting one content type will obviously fail.
My preference would be to add an optional argument to publishDir
, such as contentType
or mimeType
, which is only used by storage plugins, such as S3 or GCP, for which content types are relevant. I'd rather not have default content type inference implemented within Nextflow, however it would be fine to allow default inference by the underlying, e.g. AWS, SDK when no content type is specified, since this is standard behavior and is storage-backend specific.
I'm inclined to think it would be better to explicitly set it. However, I can also see use cases in which it may be useful to have it automatically assigned, for example when publishing a dir containing both HTML, and images files. In this case, it can be overkilling to specify the content type file-by-file.
In this case, it could be possible to have contentType: true
and nextflow automatically infers it from the extension. When is explicitly set as a string e.g. contentType: 'text/html'
just use it.
For inferring the content type, after looking around a bit I'm thinking it would be easiest to just have a map of common file extensions -> mime types. If people have extra cases they can always submit a PR.
I think there's a popular library for that, tho don't remember the name @jorgeaguileraseqera do you remember it?
Found Apache Tika https://www.baeldung.com/java-file-mime-type#apache-tika
Actually, it looks there's a Java native solution, which should be our fav choice
https://www.baeldung.com/java-file-mime-type#java-7
I agree, it would be nice to have a new process directive where you could provide a filename (or pattern/extension) to content-type map that can be configured globally or for individual processes/labels/etc.
A nice implementation could be that passing a boolean turns auto-inference on/off, a string sets the value directly, or a map would be used for lookup. A built-in map or library could be used when the value is set to true
, or a user specified one if provided (either directly or at a higher level of the configuration hierarchy). However the default behavior should be to let the SDK decide (i.e. AWS or GCP).
Not sure if this is being considered, but I think adding a new argument to publishDir
would be more confusing than a new contentType
(or similar) directive.
@birnbera I'm guessing the issue with that is the SDK isn't being used to transfer files, but rather a pseudo network drive like s3fs
Tentative implementation 0d29df0d
awesome @pditommaso. I would be happy to give it a test drive - what option arguments did you end up going with?
I tried this out today, setting contentType: true
to the publishDir directive of an html file, and it worked great. Would be happy to see it appear in a future release! Many thanks.
@pditommaso This is a desirable feature and seems to work great in my testing. Just curious, are you considering it on a roadmap for future release? Many thanks.
we'll try to include in the next release
Merged finally. 02afa332
Should contentType: true
work recursively on path outputs (meaning all the files under a subdirectory)?
Does this work for GCS?
No, only S3 at the moment