Proposal for improving remote data format inference

Open pierrecamilleri opened this issue 10 months ago • 0 comments

Context

An issue related to this one has already been submitted #1646

I've taken the liberty of opening a new issue more specifically linked to the format inference problem, for the sake of clarity. If you'd prefer me to continue with the above issue instead and close this one, let me know and I'll take care of it.

Issue

The file format is guessed through the file extension (if I am not mistaken, in this function). If this seems, at my level of knowledge of file storage, a good strategy to guess a local file format, it falls short for many use cases of remote (at least over http(s)) csv resources.

Indeed, many APIs do not have an explicit extension when offering csv files.

Issue reproduction

With frictionless v4.40.11 :

$ frictionless describe https://data.capatlantique.fr/api/explore/v2.1/catalog/datasets/244400610_subventions_liste/exports/csv

[...]
format: ''
[...]

(The problem remains in v5, tested with v5.16.1, but I could not find how to reproduce this output)

Workaround

As mentionned in this comment, the workaround is to explicitly provide the format.

Proposal

http(s) response formats are usually in the response's Content-Type header.

It would seem appropriate to use this information to infer the file format.

e.g. looking at the headers of the request of the above url (curl -v https://data.capatlantique.fr/api/explore/v2.1/catalog/datasets/244400610_subventions_liste/exports/csv) indeed shows :

content-type: text/csv; charset=utf-8

Some additional improvements could be made using the response headers, as we can see that the encoding is also mentionned, and we can find e.g. a more relevant filename in the Content-Disposition header :

content-disposition: attachment; filename="244400610_subventions_liste.csv"

Apr 17 '24 10:04 pierrecamilleri

framework framework copied to clipboard

Proposal for improving remote data format inference

Context

Issue

Issue reproduction

Workaround

Proposal

framework
framework copied to clipboard