specs icon indicating copy to clipboard operation
specs copied to clipboard

Future possibility for delimiter-separated list for arrays (instead of JSON array)?

Open thbar opened this issue 3 years ago • 6 comments

I'm a colleague of @geoffreyaldebert, working on a French national CSV schema for "bikes counting" at the moment.

My understanding is that https://github.com/frictionlessdata/specs/issues/712 https://github.com/frictionlessdata/frictionless-py/issues/627 introduced a way to restrict allowed values in an array, which is neat.

In our case (based on input from future data producers and reusers), we would like to avoid using JSON arrays for those values, and instead use delimiter-separated values, which are less complicated to write and decode without troubles for less technical users.

The rationale is that we are creating a CSV schema to avoid JSON in the first place, which some users find confusing with their current level of technicality, to drive adoption.

Our current solution (WIP, the schema is not published yet) is to use a regex pattern:

https://github.com/etalab/schema-comptage-velo/blob/15096e6145b4926530a6fc5126db8cd25e35c803/schema.json#L175-L184

It is a trick commonly used before for that case (e.g. https://schema.data.gouv.fr/etalab/schema-inclusion-numerique/latest/documentation.html#propriété-public_cible).

So my question is: is there room to consider future evolutions to add a "CSV-array" column type, with restrictions on actual values to be in an allowed range?

Thanks!

thbar avatar May 10 '21 07:05 thbar

We discussed it in Discord and I think that type: array; format: separator to have something like:

id,array
1,"A,B,C"

might make sense for the specs

roll avatar May 10 '21 08:05 roll

FWIW, I have had some feedback from users who would possibly appreciate to have a non-comma separator (e.g. |), which is a "lower tech" way to achieve this and requires less escaping. I am not sure I want to encourage that, though. Ideally just a , as a separator would be quite coherent with the regular case.

Thanks for considering this, it would be great to have and would let us clean a few schemas!

thbar avatar May 27 '21 16:05 thbar

I'm currently working on a PR to integrate this. I've made the relevant changes in array.py but now need to integrate it elsewhere and add tests.

Currently I'm getting this error (below), which I can get rid of if I remove the format entry for the field. FrictionlessException: [field-error] Field is not valid: "{'name': 'sett_bmu_id', 'type': 'array', 'format': ', ', 'array_item': {'type': 'string'}, 'description': 'The Balancing Mechanism Unit identifier used for settlement purposes by Elexon', 'title': 'Settlement BMU ID'} is not valid under any of the given schemas" at "" in metadata and at "anyOf" in profile

Where should I be looking to add this to the schema?

AyrtonB avatar Aug 12 '21 08:08 AyrtonB

@AyrtonB It must be a JSONSchema rule in frictionless/assets/profiles/schema/general.json. We need to update the format definition there for array types

roll avatar Aug 12 '21 09:08 roll

That makes sense, I'll do that

AyrtonB avatar Aug 12 '21 09:08 AyrtonB

I'll continue discussion around specifics of this implementation in the PR linked above

AyrtonB avatar Aug 12 '21 11:08 AyrtonB

Is is already possible to specify arrays without the square brackets? I would say it is the normal case for CSV files. You have a value like 594866,594868,608288 and each number references to a primary key in another CSV files.

jze avatar Feb 20 '23 07:02 jze

Hi, I've created a feature request for the framework to pilot the feature:

  • https://github.com/frictionlessdata/framework/issues/1434

roll avatar Feb 20 '23 09:02 roll