tiled icon indicating copy to clipboard operation
tiled copied to clipboard

Document a `Spec` example of a proposed data quality tag

Open padraic-shafer opened this issue 1 year ago • 1 comments

disclaimer: While this idea does not require any changes to Tiled modules, it reflects a proposed Spec that might be useful to include in the documented examples.


Researchers generally capture lots of data but not all of it is of suitable quality for further analysis or publication. It would be valuable to capture that quality assessment as soon as it is made, and store that assessment with the dataset. Adding a quality assessment to a dataset's metadata in Tiled would enable filtering of datasets, readily shareable with colleagues or agents for algorithm training.

It would seem prudent to keep the schema as simple as possible, adding to it in an extensible yet documented manner when necessary. For example, a minimal JSON schema might look like this:

{
  "$id": "https://example.com/data-quality-tag.schema.json",
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "Tag",
  "type": "object",
  "properties": {
    "quality": {
      "description": "Quality assessment.",
      "type": "object",
      "properties": {
        "label": {
          "description": "Level of quality.",
          "type": "string",
          "enum": ["Exemplary", "Good", "Bad", "Alignment", "Review Later"]
        },
        "reason": {
          "description": "Why this quality level was selected.",
          "type": "string"
        }
      },
      "required": ["label"]
    },
    "reporter": {
      "description": "Who created this tag.",
      "type": "object",
      "properties": { "$ref": "https://example.com/person.schema.json" }
    },
    "time": {
      "description": "When this tag was created.",
      "type": "string",
      "format": "date-time"
    },
  "required": [ "quality", "reporter", "time" ]
  }
}

One goal is to keep the enumerated quality levels as a small and well-defined set. Here are the envisioned meanings.

Quality Label Meaning
Exemplary This is a prototypical example of the system under study. Publish this. Keep this dataset forever.
Good This is a good dataset for further analysis.
Bad This data is flawed and should be discarded, or could be used as a counter-example for future training.
Alignment This data was captured as part of an alignment step in an experiment. It does not represent the system under study, but could be useful for debugging.
Review Later There is something interesting or unusual about this dataset, but the impact is unclear. Revisit this at some point.

Example

It is envisioned that multiple tags could be appended to the metadata field of a Tiled node (container or a nested field/subset).

metadata = {
  ...,
  tags: [
    {
      quality: {
        label: "Bad", 
        reason: "Could not locate peak."}, 
      reporter: {name: "qualbot", uid: "..."}, 
      time: "2023-12-14T01:52:13-05:00",
    },
    {
      quality: {
        label: "Good", 
        reason: "Background confused the peak detection routine."}, 
      reporter: {name: "Padraic", uid: "..."}, 
      time: "2023-12-16T08:17:06-08:00",
    },
    {
      quality: {
        label: "Good", 
      reporter: {name: "Pete", uid: "..."}, 
      time: "2024-01-10T13:42:52-06:00",
    },
  ],
  ...
}

padraic-shafer avatar Dec 16 '23 16:12 padraic-shafer

At some point, we’ll need to consider what’s needed to facilitate schema evolution for forward/backward compatibility. It might be worthwhile to already add a version field to the tag definition.

padraic-shafer avatar Dec 24 '23 22:12 padraic-shafer