specs Introduce a codeList property to the field descriptor

Introduce a codeList property to the field descriptor

Open djvanderlaan opened this issue 11 months ago • 10 comments

We work a lot with surveydata and administrative data. In both cases files often contain fields where the values in the field should come from a limited list of possible values. These values also have a specific meaning. Some examples:

Questions from a questionary coded as 1-5 and 9, where 1="Strongly disagree", 2="Disagree", ... 5="Strongly agree" and 9 = "Did not answer"
Administrative data where one of the columns is the economic classification of a company (e.g. using NACE).

Properties of these codes:

In the examples the codes are either integer of string values.
These values should be from a limited list of valid codes.
Some values indicate missing values.
The values usually have labels and/or descriptions.
Sometimes the codes contain a hierarchy. This is for example often the case with the NACE codes.
The lists can be small (2-3 codes) or very large (70,000 in the case of ICD10).

We are aware of the suggestion in issue #875 for supporting categories which is the same issue/problem. However, there are a few 'wishes' that are not covered by the suggestion in that issue and we believe the suggestion below is also easier to implement.

What we would like/need:

Possibility to indicate that a given field should use values/codes from a given list.
It should be possible to store this list in the datapackage meta data (datapackage.json) itself or have the codes in a file as large lists of codes make the meta data too bloated and this makes maintenance also more difficult. This file could be part of the datapackage or could be hosted externally.
It should preferably be possible to define hierarchies in the codes.

What we suggest:

Add a property codeList to the FieldDescriptor. This MUST be string with the name of a DataResource in the DataPackage (if there is a syntax for referencing to a DataResource in an external DataPackage, this would also be valid).

This has a number of advantages:

DataResources allow for inline data in the data property; for files in the DataPackage using the path property with a relative path and for external data using the path property with a URL.
Furthermore, we can reuse properties like name, descriptor source and license and schema to describe the code list.
We can use all of the properties and tooling that are available already for DataResources. This makes implementation less work. Indicating a codelist for a field can already be useful without specific tooling for codelists. The user can see that the field uses a codelist and can manually read the corresponding data resource using currently existing tooling.

Code List Resource We don't yet have a concrete suggestion as to what should be in the dataset containing the code list and what format this dataset should have. We currently have an implementation that assumes that the first column in the dataset contains the codes and the second the labels of the codes. This is, however, minimal functionality. Some thoughts:

We could add fields to the DataResource that describe what fields are the code, labels etc.
It would be nice if it was not required that the dataresource containing the codelist is a TabularDataResource. That would also allow the data resource to point to, for example, a SDMX codelist (common for many official statistical datasets) or ClaML (used for some medical classifications). The type can be indicated using the format and/or mediatype properties of the data resource. However, the default (and only supported format) would be regular Tabular Data Resources.
We can take inspiration from how codelists are handled in SDMX. For example, SDMX2 allows for codes, labels, descriptions, parents (for simple hierarchies) and multi lingual labels etc. The functionality in SDMX3 is much more extensive (codes can have periods of validity; codelists can be part of multiple hierarchies; codelists can be selections or extensions of existing codelists). If we follow the basic SDMX codelist, then a basic code list would have the columns code (or id) and name ; with optional columns description, locale and parent (indicating missing values seems to be missing; I can check with SDMX experts how this is handled; probably using custom annotations).

Example Possible example with both a codelist in a file and inline data:

{ 
  "name": "highest_education",
  "resources": [
    { 
      "name": "edulevel",
      "format": "csv",
      "mediatype": "text/csv",
      "path": "edulevel.csv",
      "encoding": "utf-8",
      "schema": { 
        "fields": [
          { 
            "name": "id",
            "type": "integer"
          }, {
            "name": "place_of_residence",
            "type": "string",
            "codeList": "codelist-regions"
          }, { 
            "name": "edu_level",
            "type": "integer",
            "codeList": "codelist-edu_level"
          }
        ]
      }
    }, {
      "name": "codelist-regions",
      "schema": { 
        "fields": [
          { 
            "name": "code",
            "type": "string"
          }, { 
            "name": "name",
            "type": "string"
          }, {
            "name": "parent",
            "type": "string"
          } 
        ]
      },
      "path": "codelist-regions.csv"
    }, {
      "name": "codelist-edu_level",
      "schema": { 
        "fields": [
          { 
            "name": "code",
            "type": "integer"
          }, { 
            "name": "name",
            "type": "string"
          }
        ]
      },
      "data": [
        {"code": 1, "name": "Low education"},
        {"code": 2, "name": "Medium education"},
        {"code": 3, "name": "High education"}
      ]
    }
  ]
}

@fomcl

Mar 02 '24 15:03 djvanderlaan

specs specs copied to clipboard

Introduce a codeList property to the field descriptor

specs
specs copied to clipboard