aws-cli icon indicating copy to clipboard operation
aws-cli copied to clipboard

Unable to set `charset` when mime-types are guessed (S3)

Open gf3 opened this issue 10 years ago • 28 comments

We are syncing a directory of various file types to an S3 bucket and aws-cli is correctly guessing the mime-types, however in our case it's important that it also append the charset. For example the guessed content-type for index.html might look like this:

Content-Type: text/html

But we'd like a way to be able to tell aws-cli that the charset for all the synced files is UTF8 for instance:

Content-Type: text/html; charset=utf-8

Version: aws-cli/1.7.26 Python/2.7.6 Darwin/14.4.0

gf3 avatar May 27 '15 16:05 gf3

+1

jdjkelly avatar May 27 '15 16:05 jdjkelly

:+1:

chainlink avatar May 27 '15 16:05 chainlink

:+1:

mathiasbynens avatar May 27 '15 16:05 mathiasbynens

:panda_face:

kieran avatar May 27 '15 16:05 kieran

:+1:

adammeghji avatar May 27 '15 16:05 adammeghji

:+1:

darcyclarke avatar May 27 '15 17:05 darcyclarke

Marking as feature request. Any suggestions on how you would like to see it exposed in the CLI would be appreciated.

kyleknap avatar May 27 '15 17:05 kyleknap

@kyleknap perhaps via --charset option? which would be appended to the guessed mime-type.

gf3 avatar May 27 '15 17:05 gf3

You can explicitly set content-type for s3 cp/sync and s3api put-object APIs.

For s3 cp/sync, use --content-type option.

$ aws s3 cp --content-type 'text/plain; charset=utf-8' index.html s3://BUCKET/index.html
upload: ./index.html to s3://BUCKET/index.html
$ aws s3api head-object --bucket BUCKET --key index.html
{
    "AcceptRanges": "bytes",
    "ContentType": "text/plain; charset=utf-8",
    "LastModified": "Thu, 28 May 2015 14:18:42 GMT",
    "ContentLength": 12,
    "ETag": "\"6f5902ac237024bdd0c176cb93063dc4\"",
    "Metadata": {}
}

$ aws s3 sync foo s3://BUCKET/foo  --content-type 'text/html; charset=utf-8'
upload: foo/index.html to s3://BUCKET/foo/index.html
$ aws s3api head-object --bucket BUCKET --key foo/index.html
{
    "AcceptRanges": "bytes",
    "ContentType": "text/html; charset=utf-8",
    "LastModified": "Thu, 28 May 2015 14:30:54 GMT",
    "ContentLength": 12,
    "ETag": "\"6f5902ac237024bdd0c176cb93063dc4\"",
    "Metadata": {}
}

For s3api put-object, use --content-type option.

$ aws s3api put-object --content-type 'text/html; charset=latin-1' --bucket
BUCKET --key index2.html --body index.html
$ aws s3api head-object --bucket BUCKET --key index2.html
{
    "AcceptRanges": "bytes",
    "ContentType": "text/html; charset=latin-1",
    "LastModified": "Thu, 28 May 2015 14:26:03 GMT",
    "ContentLength": 12,
    "ETag": "\"6f5902ac237024bdd0c176cb93063dc4\"",
    "Metadata": {}
}

Is this different from want you want?

quiver avatar May 28 '15 14:05 quiver

@quiver yes it's a bit different; we have client-side app that we're syncing with a multitude of different file types. we'd really like to continue to take advantage of the mime-type guessing feature which saves us from having to batch upload files based on type.

gf3 avatar May 28 '15 14:05 gf3

@gf3 Got it. thanks for your reply.

quiver avatar May 28 '15 15:05 quiver

@kyleknap just checking in here, anything i can do to help this along?

gf3 avatar Jun 18 '15 15:06 gf3

+1

mbystryantsev avatar Jan 04 '17 17:01 mbystryantsev

As a stop-gap, could the default "guessed" MIME type for HTML be changed to text/html; charset=utf-8 somehow?

ezzatron avatar Feb 05 '17 01:02 ezzatron

Any updates on this perhaps?

When we use -content-type "text/html; charset=utf-8" the files actually default to text/plain which then in turn simply downloads the index.html file instead of serving it. How do I address this? I have the same scenario as @gf3 where I'm trying to sync up a client-side app..

Thanks!

tihomir-kit avatar Jun 19 '17 11:06 tihomir-kit

Running into this problem myself now, trying to sync a bunch of static website files to a bucket and s3cmd is not setting the correct charset=utf8 content-type when uploading the files which contain utf8 characters.

I'd like to keep the deployment job simple by just syncing the directory up the pipe instead of having to define the content-type on a per file-type or file basis. Any way this is now possible?

dmahlow avatar Aug 28 '17 15:08 dmahlow

As @dmahlow mentioned, you can define the content-type on a per-file-type basis. Just to illustrate what that might look like:

aws s3 sync --exclude "*" --include "*.html" --content-type "text/html; charset=utf-8" --delete ./public s3://www.example.com
aws s3 sync --include "*" --exclude "*.html" --delete ./public  s3://www.example.com

perennialmind avatar Sep 30 '17 00:09 perennialmind

Good Morning!

We're closing this issue here on GitHub, as part of our migration to UserVoice for feature requests involving the AWS CLI.

This will let us get the most important features to you, by making it easier to search for and show support for the features you care the most about, without diluting the conversation with bug reports.

As a quick UserVoice primer (if not already familiar): after an idea is posted, people can vote on the ideas, and the product team will be responding directly to the most popular suggestions.

We’ve imported existing feature requests from GitHub - Search for this issue there!

And don't worry, this issue will still exist on GitHub for posterity's sake. As it’s a text-only import of the original post into UserVoice, we’ll still be keeping in mind the comments and discussion that already exist here on the GitHub issue.

GitHub will remain the channel for reporting bugs.

Once again, this issue can now be found by searching for the title on: https://aws.uservoice.com/forums/598381-aws-command-line-interface

-The AWS SDKs & Tools Team

ASayre avatar Feb 06 '18 10:02 ASayre

Based on community feedback, we have decided to return feature requests to GitHub issues.

jamesls avatar Apr 06 '18 21:04 jamesls

Thanks to @perennialmind's comment for a reasonable work around. Would be nice to be able to specify mappings of some kind, though, to avoid finicky configs like this.

theory avatar May 22 '18 02:05 theory

@gf3 First of all, the NERVE of appearing in my ONLINE EXPERIENCE. How dare you.

Second: the mimetypes module happily returns the guessed encoding (same as using file --mime-encoding [filename] on the command line) as the second element of its return tuple, but it's currently getting thrown away here:

https://github.com/aws/aws-cli/blob/072688cc07578144060aead8b75556fd986e0f2f/awscli/customizations/s3/utils.py#L294

My Python environment is hosed, but I'll take a run at a patch unless somebody beats me to it.

There's an argument to be made that it isn't aws-cli's responsibility to include the charset= portion of Content-Type for text/html files, but it's such a common use case (and the resulting mojibake so terrifying when it's omitted) that it seems worthwhile to me.

pnc avatar Jun 17 '19 13:06 pnc

Alright, so guess_type doesn't actually use libmagic under the hood, and only understands/guesses compression encodings, not text encodings. The following commit "works" to set a charset automatically on uploaded files:

https://github.com/aws/aws-cli/compare/develop...pnc:libmagic?expand=1

However, it:

  1. Probably doesn't work on Windows (at least without Cygwin)
  2. Needs to be tweaked so it doesn't cause the S3 copy unit tests to fail (I think they rely on it guessing based on filename alone?)
  3. Adds another dependency

Leaving it for posterity in case someone wants to pick up the torch, but this doesn't seem super viable unless someone from the core team encourages it.

pnc avatar Jun 17 '19 15:06 pnc

+1 for getting this solved correctly, please! My s3 copy commands are littered with include and exclude statements now 👎... looking very similar to justatheory's

JessicaSachs avatar Oct 09 '19 19:10 JessicaSachs

Still an issue, I've updated my blog publish script from the broken link above to this script. Sure wish I could specify mappings explicitly and call it once!

theory avatar Jul 05 '20 20:07 theory

This is still an issue in the v2 cli, it's a real pain!

I don't think the current default is sensible. I understand the compatibility impact in updating this, but please put this behind a feature flag at least.

ewan-realitymine avatar Nov 08 '21 17:11 ewan-realitymine

Discovered this issue today in our codebase. We are using aws CLI to upload files via S3. It would be lovely to have a way to ensure that the text/html charset is always UTF-8 by default.

linorabolini avatar Apr 06 '22 09:04 linorabolini

+1

andcip avatar May 13 '22 08:05 andcip

Hi guys, this works for my python script.

from awscli.clidriver import create_clidriver
from awscli.customizations.s3 import utils

def on_queued_charset(self, future, **kwargs):
    guessed_type = utils.guess_content_type(self._get_filename(future))
    if not guessed_type:
        return

    if "text/" in guessed_type or "application/" in guessed_type or guessed_type == 'image/svg+xml':
        guessed_type += ";charset=UTF-8"
    future.meta.call_args.extra_args["ContentType"] = guessed_type

utils.BaseProvideContentTypeSubscriber.on_queued = on_queued_charset
driver = create_clidriver()
old_stdout = sys.stdout
old_stderr = sys.stderr
sys.stdout = cli_stdout = StringIO()
sys.stderr = cli_stderr = StringIO()
args = [
        "s3",
        "sync",
       "local_dir",
       "s3://bucket",
        "--region=us-east-2",
        "--delete",
    ]
cli_status = driver.main(args)
sys.stdout = old_stdout
sys.stderr = old_stderr
print(cli_stdout.getvalue(), cli_stderr.getvalue())

maaiika avatar Jun 14 '23 02:06 maaiika