summary statistics (mean, max, min, etc.)
As mentioned during the 2024-04-10 task force call, there is interest in providing summary statistics (or more broadly, descriptive statistics) in Croissant format. I'm focusing on summary statistics (mean, max, min, median, mode, standard deviation, etc.) because they are already well defined in the Data Document Initiative (DDI) format.
For example, for a dataset with a variable called "stars" that indicates the number of stars on GitHub, the summary statistics in DDI can be represented like this:
<var ID="v30256083" name="stars" intrvl="discrete">
<location fileid="f6867331"/>
<labl level="variable">stars</labl>
<sumStat type="medn">4.0</sumStat>
<sumStat type="mean">38.71014492753635</sumStat>
<sumStat type="mode">.</sumStat>
<sumStat type="vald">138.0</sumStat>
<sumStat type="max">732.0</sumStat>
<sumStat type="invd">0.0</sumStat>
<sumStat type="min">0.0</sumStat>
<sumStat type="stdev">110.13079171235681</sumStat>
<varFormat type="numeric"/>
</var>
The question is, where can I put summary statistics in Croissant?
Update: This issue seems related (mentions statistics):
- #737
I agree with the use of DDI formats, they seem pretty well thought out and designed. Based on Controlled Vocaularies, there are a few "Controlled Vocabulary" (CV) definitions that may be useful:
While it would only be referenced, it may be a lot for implementations/tooling to support. Instead of referencing the vocabulary maybe a partial list of fields could be added to keep things simpler. On the other hand, it's probably better to be specific and thorough to prevent users from creating non-spec'd standards, especially with the plan of adding of annotations #737 #739 . Non-standard annotations a la XML attributes would be detrimental to tool compatibility.
It's probably easiest for the 1.1 spec to reference the DDI CV version numbers. There are RDF representations available, so it might be valuable to use it to generate spec and lib definitions.
Another option raised during workgroup meeting: from MLDCat - quality measurement : https://semiceu.github.io/MLDCAT-AP/releases/2.0.0/
We're currently working on inferring metadata from raw data while populating CKAN catalogs, mapping it to standards like Croissant and DCAT3 leveraging the work done in ckanext-dcat and ckanext-scheming.
https://ckan.org/blog/bridging-ckan-and-machine-learning-introducing-support-for-the-croissant-standard
We're able to do so with a data-wrangling tool we maintain called qsv.
https://github.com/dathere/qsv
It can infer summary stats from a tabular file (csv, tsv, ssv, excel, parquet, etc.) very quickly (e.g. for a 1m row, 520 mb, 41 column sample of NYC's 311 data, it can infer data type and 43 summary stats in 1.5 seconds).
https://github.com/dathere/qsv/blob/master/scripts/NYC_311_SR_2010-2020-sample-1M.stats.csv https://github.com/dathere/qsv/wiki/Supplemental#stats-command-output-explanation
Further, for the same file, it can compile a frequency table in 0.9 seconds. https://github.com/dathere/qsv/blob/master/scripts/NYC_311_SR_2010-2020-sample-1M-frequency.csv
For DCAT-US v3, we're implementing an expanded, "validating" data dictionary with summary statistics and frequency table for its describedBy property. For this, we're pointing it to a JSON Schema inferred using the qsv schema command, making use of JSON Schema's additionalProperties to add non-standard properties.
The benefit of using JSON Schema for the Data Dictionary instead of other machine-readable standards like Data Package or CSVW is that its "machine-executable" as well (i.e. use it for validation, taking 3.4 seconds to validate the same 1m row 311 sample file)
https://github.com/dathere/qsv/blob/master/resources/test/311_Service_Requests_from_2010_to_Present-2022-03-04.csv.schema.json
Would be great if we can use the same technique to add Summary statistics and optionally, frequency tables, to Croissant.
cc @wardi @amercader @samibaig @rzmk @minhajuddin2510
Thanks all for moving this discussion forward!
I don't have a strong feeling about which vocabulary we should use or recommend. I think the spec should allow both the DDI vocabularies and the MLDCAT one.
Regarding the mechanism to add these external properties to Croissant, I am leaning towards something very similar to JSON Schema's additionalProperties... based on the equivalent construct schema.org/additionalProperty. I will send out a proposal shortly on #885 .