rdf4j
rdf4j copied to clipboard
Mimetype for CSV Sparql Query Results should use correct encoding as defined in the Specification
Current Behavior
The query results are encoded in UTF-8:
public static final TupleQueryResultFormat CSV = new TupleQueryResultFormat("SPARQL/CSV", List.of("text/csv"),
StandardCharsets.UTF_8, List.of("csv"), SPARQL_RESULTS_CSV_URI, NO_RDF_STAR);
The specification says:
Systems providing these formats should note that the content types for CSV is text/csv and for TSV text/tab-separated-values. Being text/*, the default character set is US-ASCII. The charset parameter should be used in conjunction with SPARQL Results; UTF-8 is recommended: text/csv; charset=utf-8 and text/tab-separated-values; charset=utf-8.
But the mimetype exposed by RDF4J is "text/csv" (in SparqlMimeTypes)
public static final String CSV_VALUE = "text/csv";
UTF-8 is obviously the correct choice, but standard clients like the python requests library are assuming "ISO-8859-1" for the Content Type "text/csv".
I can modify the rest controllers to not use the standard RDF4J mimetypes, eg.
@PostMapping(value = "/query", consumes = {MediaType.TEXT_PLAIN_VALUE, SparqlMimeTypes.SPARQL_QUERY_VALUE},
produces = { SparqlMimeTypes.JSON_VALUE, SparqlMimeTypes.CSV_VALUE+ ";charset=UTF-8"}
)
@ResponseStatus(HttpStatus.OK)
Flux<BindingSet> queryBindingsPost(@RequestBody String query) {...}
but then I have to map from "text/csv;charset=UTF-8" to "text/csv" everywhere else, to get the correct ResultWriters.
Expected Behavior
public static final TupleQueryResultFormat CSV = new TupleQueryResultFormat("SPARQL/CSV", List.of("text/csv"), StandardCharsets.UTF_8, List.of("csv"), SPARQL_RESULTS_CSV_URI, NO_RDF_STAR);
should be text/csv;charset=utf-8
If "text/csv" remains included, the SPARQLResultsCSVWriter should use "ISO-8859-1" as encoding (with a warning maybe?))
Steps To Reproduce
- Expose a sparql endpoint using the standard mimetypes defined in RDF4J
- Call it with the python requests library and see, that is encodes the result in "ISO-8859-1"
response = requests.post(
url=f"...",
data=query.encode("utf-8"),
headers={
"X-API-KEY": api_key,
"Content-Type": "text/plain",
"Accept": "text/csv",
"X-Application": scope,
},
)
enc = response.encoding # is "ISO-8859-1", but in reality it is "UTF-8"
Version
4.3.8
Are you interested in contributing a solution yourself?
Perhaps?
Anything else?
No response