packages-http icon indicating copy to clipboard operation
packages-http copied to clipboard

Feature request: Charset by extension

Open triska opened this issue 7 years ago • 9 comments

Use case: I would like to serve UTF-8 encoded *.txt files.

When I use the following server:

:- use_module(library(http/thread_httpd)).
:- use_module(library(http/http_unix_daemon)).
:- use_module(library(http/http_files)).

:- http_handler(root(.), http_reply_from_files(., []), [prefix]).

then I can fetch *.txt files. However, the content-type in responses is:

Content-Type: text/plain

whereas I would like to get responses such as:

Content-Type: text/plain; charset=UTF-8

Is there an easy way to configure the SWI infrastructure to specify charset=UTF-8 when serving *.txt files? Alternatively, would you please consider adding such a feature? Thank you!

triska avatar Mar 04 '18 11:03 triska

@triska There is a hook http:mime_type_encoding/2 in library http/http_header that can be used to associate encodings with Media Types. http_reply_file/3 already uses hook mime:mime_extension/2 to associate Media Types with the file extension of served files. Maybe http_reply_file/3 could be taught to not only use the Media Type associations, but also their encodings as per http:mime_type_encoding/2?

wouterbeek avatar Mar 04 '18 12:03 wouterbeek

Complicated issue. The good news is that more and more tools seem to encourage people adding encoding/charset declarations so we can reduce the guessing we need. As Wouter knows better than me, the claim is often wrong, so we still have a long way to go ...

The hook in http_header.pl looks promising, but serves a different purpose. It is there to ensure that the HTTP streams have the correct Prolog encoding, so it is facing inwards rather than outwards en targeted at documents you generate in Prolog rather than static files. In my experience using something that is meant for A in the context of B is asking for trouble.

Probably the best way is to extend the mimetype library. I'm not entirely sure how though. There is no sensible default encoding for text files in general. There may be one for a particular deployment, probably defined on a combination of the current locale, content of the files and intended audience. For example, if you have a pure Russian website, using the 'KOI8-R' charset makes perfect sense, as ISO latin 1 does for a group of languages. If you have a heterogeneous set of files you probably use UTF-8.

Could we derive the default from the Prolog encoding/locale? Probably hard as locale names differ by OS and are not standardized AFAIK. On the other hand, if we have a text file and the Prolog encoding is utf8 we could assume that adding charset=UTF-8 as a default makes sense. This would indicate we need an indirection, first from media type to text/binary and then to charset for the text files.

Does this make sense?

JanWielemaker avatar Mar 04 '18 17:03 JanWielemaker

In my opinion, at least for *.txt files, almost any concrete (and fixed) default charset that can represent a reasonable set of languages is better than nothing.

If there were, in addition, a way to map file extensions to charsets (in analogy to the already existing extension → content-type mapping), I would already consider it a huge improvement, because it would at least let users specify the encoding they are using for their files, whether it is ISO Latin-1, Shift JIS, UTF-8 or KOI8-R etc.

For comparison, please see the Apache directive AddCharset:

AddCharset EUC-JP .euc
AddCharset ISO-2022-JP .jis
AddCharset SHIFT_JIS .sjis

This directive lets you specify charsets by extension and even for particular files individually:

<Files "example.html">
AddCharset UTF-8 .html
</Files>

One or two extensible predicates for corresponding settings in SWI-Prolog would be a very welcome addition!

triska avatar Mar 04 '18 20:03 triska

I agree this needs a solution. Just, how? Based on Wouter's comment I considered associating a media type with a charset. Problem then of course is that you cannot address individual files or sets of files that share the same media type. This suggests adding it to the file name/extension, which is what the mimetype.pl also does for media types. It seems rather unpractical to replicate this for charsets while in 99% of the cases you just want to say all text files are UTF-8 (or something else).

I'm thinking of something like this:

  • Have a mapping from media type to text/binary (hooked using a multifile predicate)

  • Have a Prolog flag or similar that defines the charset of text files.

  • Have a hookable rule that operates on the filename. The default computes the media type, uses the above mapping to determine it is a text file and then the flag to determine the charset. You can hook it to do whatever you like. We should probably make the full filename accessible from the hook, so you can do things such as checking for a BOM marker or look for a config file in the directory of the file. I would consider something like

    charset(+FileName, +Mediatype, -Charset) is semidet.
    

Add this stuff to mimetype.pl.

Does that cover what we want?

JanWielemaker avatar Mar 05 '18 01:03 JanWielemaker

I think http:charset/3 as you show it in your last bullet point would be a good solution.

The other approaches you mention (in particular a tight coupling between content types and encodings) seem somewhat dubious and too inflexible to me.

The following thread contains some interesting settings that people find useful in practice:

https://stackoverflow.com/questions/913869/how-to-change-the-default-encoding-to-utf-8-for-apache

Please consider in particular the following:

<Files ~ "\.html?$">
     Header set Content-Type "text/html; charset=utf-8"
</Files>

That's clearly much more flexibility than is needed for this concrete issue. However, I still find this very interesting: You can configure Apache to emit particular header fields based on file names. This flexibility could be useful in SWI-Prolog too.

The http:charset/3 hook is of course a special case of such a more general mechanism, and would be great for the particular case at hand.

triska avatar Mar 05 '18 18:03 triska

I pushed cd44ae46d4d2dffe1af015b03a69bcf10800ba09 which I think both provides a fair default as well as the option to take it all in your hand. Please have a look and close it solves your problems.

JanWielemaker avatar Mar 08 '18 01:03 JanWielemaker

Thank you! The description of file_content_type/[2,3] is unclear to me: In particular, the following sequence seems not to capture what is actually implemented:

1. Determine the media type using file_mime_type/2
2. Determine it is a text file using text_mimetype/1
3. Use the charset from the Prolog flag `default_charset`

I do not know whether the description or code is (in)correct, but it seems what is meant is:

  1. Unless it is already specified, determine MediaType using file_mime_type/2.
  2. If the media type indicates a text file, derive and indicate the associated charset.
  3. For other media types, use the media type but do not indicate a charset.

The description of the hooks and flags seems also not to capture what is actually happening. In particular, I suggest:

  • mime:charset/3 derives the charset for a file with a given media type, if the media type is text according to mime:text_mimetype/1

To me personally, it is also a bit surprising that the charset indication now hinges on text MIME types. Certainly one can think of MIME types such as (hypothetical) model/autocad-xml where you would also like to indicate a charset?

triska avatar Mar 09 '18 18:03 triska

Thanks for the suggestions. Applied. The idea is to use mime:text_mimetype/1 to define those mime types that you have as files in your current locale on disk. That should suffice for most setups. In the case you have text files using different encodings, still use mime:text_mimetype/1 and then the charset hook to do whatever is needed (check BOM, check extension, look for meta-data in the dir, ...)

JanWielemaker avatar Mar 10 '18 01:03 JanWielemaker

Thank you, it works!

At least the charset parameter that is mentioned in the following TBD seems now fully implemented:

https://github.com/SWI-Prolog/packages-http/blob/master/mimetype.pl#L49

If applicable, please consider removing the TBD or citing a different example.

triska avatar Mar 11 '18 19:03 triska