rdf4j icon indicating copy to clipboard operation
rdf4j copied to clipboard

Mimetype for CSV Sparql Query Results should use correct encoding as defined in the Specification

Open pajoma opened this issue 7 months ago • 0 comments

Current Behavior

The query results are encoded in UTF-8:

public static final TupleQueryResultFormat CSV = new TupleQueryResultFormat("SPARQL/CSV", List.of("text/csv"),
     StandardCharsets.UTF_8, List.of("csv"), SPARQL_RESULTS_CSV_URI, NO_RDF_STAR);

The specification says:

Systems providing these formats should note that the content types for CSV is text/csv and for TSV text/tab-separated-values. Being text/*, the default character set is US-ASCII. The charset parameter should be used in conjunction with SPARQL Results; UTF-8 is recommended: text/csv; charset=utf-8 and text/tab-separated-values; charset=utf-8.

But the mimetype exposed by RDF4J is "text/csv" (in SparqlMimeTypes)

public static final String CSV_VALUE = "text/csv";

UTF-8 is obviously the correct choice, but standard clients like the python requests library are assuming "ISO-8859-1" for the Content Type "text/csv".

I can modify the rest controllers to not use the standard RDF4J mimetypes, eg.

    @PostMapping(value = "/query", consumes = {MediaType.TEXT_PLAIN_VALUE, SparqlMimeTypes.SPARQL_QUERY_VALUE},
            produces = { SparqlMimeTypes.JSON_VALUE, SparqlMimeTypes.CSV_VALUE+ ";charset=UTF-8"}
    )
    @ResponseStatus(HttpStatus.OK)
    Flux<BindingSet> queryBindingsPost(@RequestBody String query) {...}

but then I have to map from "text/csv;charset=UTF-8" to "text/csv" everywhere else, to get the correct ResultWriters.

Expected Behavior

public static final TupleQueryResultFormat CSV = new TupleQueryResultFormat("SPARQL/CSV", List.of("text/csv"), StandardCharsets.UTF_8, List.of("csv"), SPARQL_RESULTS_CSV_URI, NO_RDF_STAR); 

should be text/csv;charset=utf-8

If "text/csv" remains included, the SPARQLResultsCSVWriter should use "ISO-8859-1" as encoding (with a warning maybe?))

Steps To Reproduce

  1. Expose a sparql endpoint using the standard mimetypes defined in RDF4J
  2. Call it with the python requests library and see, that is encodes the result in "ISO-8859-1"
            response = requests.post(
                url=f"...",
                data=query.encode("utf-8"),
                headers={
                    "X-API-KEY": api_key,
                    "Content-Type": "text/plain",
                    "Accept": "text/csv",
                    "X-Application": scope,
                },
            )
   
            enc = response.encoding  # is "ISO-8859-1", but in reality it is "UTF-8"

Version

4.3.8

Are you interested in contributing a solution yourself?

Perhaps?

Anything else?

No response

pajoma avatar Dec 13 '23 12:12 pajoma