specs
specs copied to clipboard
Future possibility for delimiter-separated list for arrays (instead of JSON array)?
I'm a colleague of @geoffreyaldebert, working on a French national CSV schema for "bikes counting" at the moment.
My understanding is that https://github.com/frictionlessdata/specs/issues/712 https://github.com/frictionlessdata/frictionless-py/issues/627 introduced a way to restrict allowed values in an array, which is neat.
In our case (based on input from future data producers and reusers), we would like to avoid using JSON arrays for those values, and instead use delimiter-separated values, which are less complicated to write and decode without troubles for less technical users.
The rationale is that we are creating a CSV schema to avoid JSON in the first place, which some users find confusing with their current level of technicality, to drive adoption.
Our current solution (WIP, the schema is not published yet) is to use a regex pattern:
https://github.com/etalab/schema-comptage-velo/blob/15096e6145b4926530a6fc5126db8cd25e35c803/schema.json#L175-L184
It is a trick commonly used before for that case (e.g. https://schema.data.gouv.fr/etalab/schema-inclusion-numerique/latest/documentation.html#propriété-public_cible).
So my question is: is there room to consider future evolutions to add a "CSV-array" column type, with restrictions on actual values to be in an allowed range?
Thanks!
We discussed it in Discord and I think that type: array; format: separator
to have something like:
id,array
1,"A,B,C"
might make sense for the specs
FWIW, I have had some feedback from users who would possibly appreciate to have a non-comma separator (e.g. |
), which is a "lower tech" way to achieve this and requires less escaping. I am not sure I want to encourage that, though. Ideally just a ,
as a separator would be quite coherent with the regular case.
Thanks for considering this, it would be great to have and would let us clean a few schemas!
I'm currently working on a PR to integrate this. I've made the relevant changes in array.py
but now need to integrate it elsewhere and add tests.
Currently I'm getting this error (below), which I can get rid of if I remove the format
entry for the field.
FrictionlessException: [field-error] Field is not valid: "{'name': 'sett_bmu_id', 'type': 'array', 'format': ', ', 'array_item': {'type': 'string'}, 'description': 'The Balancing Mechanism Unit identifier used for settlement purposes by Elexon', 'title': 'Settlement BMU ID'} is not valid under any of the given schemas" at "" in metadata and at "anyOf" in profile
Where should I be looking to add this to the schema?
@AyrtonB
It must be a JSONSchema rule in frictionless/assets/profiles/schema/general.json
. We need to update the format
definition there for array
types
That makes sense, I'll do that
I'll continue discussion around specifics of this implementation in the PR linked above
Is is already possible to specify arrays without the square brackets? I would say it is the normal case for CSV files. You have a value like 594866,594868,608288
and each number references to a primary key in another CSV files.
Hi, I've created a feature request for the framework to pilot the feature:
- https://github.com/frictionlessdata/framework/issues/1434