datapusher-plus icon indicating copy to clipboard operation
datapusher-plus copied to clipboard

Advanced Data Dictionary

Open jqnatividad opened this issue 2 years ago • 12 comments

Currently, CKAN's Data Dictionary is limited to data type, label and description.

With qsv stats we collect descriptive statistics when we infer each column's data type during the Analysis phase of a DP+ job.

Currently - sum, min/max, min/max length, mean, stddev, variance, quartiles, median, modes, cardinality & nullcount.

When cardinality = rowcount and nullcount = 0, we can infer that a column can be a primary key and be a unique index, and annotate its data dictionary accordingly (and going further, create a unique index on it after the Copy phase).

When nullcount = rowcount, we can infer that a column is empty, and note it in the data dictionary as well.

And with qsv frequency - we can also compile frequency tables for the top N values of a column, and if the cardinality of a column is below a given N threshold, we can even infer the domain of a column as enumerated values.

Since we paid for compiling the statistics when we inferred the column data types already, we can store these statistics in the data dictionary as well as "schemata" (a term I coined for schema metadata) for "free" ( or nearly free, as running qsv frequency is not currently done by DP+, but even against a large file like the 1m row, 500mb NYC 311 benchmark data, it only takes 2.8 seconds).

We have several options:

  1. add additional properties to the existing Data Dictionary JSON which is stored as a table comment
  2. keep the existing Data Dictionary JSON as is, and store RESOURCE_NAME-stats.csv and RESOURCE_NAME-freq.csv as "system resources", that can be downloaded and queried with the CKAN API.
  3. extend the Data Dictionary with additional properties, and also store the -stats and -freq CSVs as system resources.
  4. alternatively, instead of using the original Data Dictionary JSON, we can instead insert the jsonschema file produced by the qsv schema command. The added benefit of doing so is that we use the jsonschema file with qsv validate to check if an external file conforms to the schema. And since qsv validate accepts a jsonschema URL, you can even validate an external file against the CKAN hosted jsonschema.
  5. Do 4, and add the "system resources" like 3.
  6. Store all these schemata data in a "schemata catalog" in the datastore database or a dedicated schemata database as native PostgreSQL objects. Perhaps, by using a resource's ID and adding a special prefix and/or suffix to it (e.g. RESOURCEID_datadict, RESOURCEID_stats, RESOURCEID_freq). Doing so has the added benefit of being able to query all the data dictionaries - e.g. columns with the same name, infer related resources, suggest joins, etc. and other "Linked Data" like queries and inferences in a performant manner.

@wardi , since you originally implemented the Data Dictionary, would be keen to get your opinion.

jqnatividad avatar Apr 28 '22 22:04 jqnatividad