framework
framework copied to clipboard
Proposal for improving remote data format inference
Context
An issue related to this one has already been submitted #1646
I've taken the liberty of opening a new issue more specifically linked to the format inference problem, for the sake of clarity. If you'd prefer me to continue with the above issue instead and close this one, let me know and I'll take care of it.
Issue
The file format is guessed through the file extension (if I am not mistaken, in this function). If this seems, at my level of knowledge of file storage, a good strategy to guess a local file format, it falls short for many use cases of remote (at least over http(s)) csv resources.
Indeed, many APIs do not have an explicit extension when offering csv files.
Issue reproduction
With frictionless v4.40.11 :
$ frictionless describe https://data.capatlantique.fr/api/explore/v2.1/catalog/datasets/244400610_subventions_liste/exports/csv
[...]
format: ''
[...]
(The problem remains in v5, tested with v5.16.1, but I could not find how to reproduce this output)
Workaround
As mentionned in this comment, the workaround is to explicitly provide the format.
Proposal
http(s) response formats are usually in the response's Content-Type header.
It would seem appropriate to use this information to infer the file format.
e.g. looking at the headers of the request of the above url (curl -v https://data.capatlantique.fr/api/explore/v2.1/catalog/datasets/244400610_subventions_liste/exports/csv
) indeed shows :
content-type: text/csv; charset=utf-8
Some additional improvements could be made using the response headers, as we can see that the encoding is also mentionned, and we can find e.g. a more relevant filename in the Content-Disposition header :
content-disposition: attachment; filename="244400610_subventions_liste.csv"