cubiql icon indicating copy to clipboard operation
cubiql copied to clipboard

Sharing distinct schema types where possible

Open RickMoynihan opened this issue 6 years ago • 0 comments

Dimensions and other types aren't shared leading to a proliferation of schema types, e.g. for gender:

screen shot 2017-12-01 at 16 24 32

Whilst for gender it's not a huge problem, for areas/timeperiods etc it will lead to lots of highly duplicated data in the schema.... i.e millions of items.

If we could share the "distinct schema types" across datasets then things would be manageable, and orders of magnitude smaller.

"distinct schema types" here means "distinct dim/dim-val sets", i.e. I think on scotland there would currently only need to be three distinct gender dimension sets:

  • #{all male female} (e.g. http://statistics.gov.scot/data/reconvictions)
  • #{all male female unknown} (e.g. http://statistics.gov.scot/data/child-benefit)
  • #{male female} (e.g. http://statistics.gov.scot/data/life-expectancy)

It's worth noting that the job of identifying distinct codelists would be easier if we could pass the buck and model the data that way in the first place. For example scotland currently duplicates codelists per dataset e.g. this codelist is unique, but could be reused by most of the gender datasets on scotland (assuming the data management practices did the right thing).

Managing code lists as distinct value sets would also make identifying comparable datasets easier, as they would literally re-use the same URI - but at the expense of extra complexity in handling dataset changes.

For areas the savings would obviously be much greater, as on scotland stats are published with full coverage every time; so it could be quite easily managed. For other areas with a more adhoc approach to coverage we'd need more intelligence in the data management to avoid duplicated types; though I suspect duplicating at the small scale e.g. within Trafford / GM is not a problem as there will be so much less data.

If we managed codelists in this way we could solve #55 more easily as having ~7000 datazones within a single enum isn't a major problem; but having to have 300 datasets * ~7000 that is.

RickMoynihan avatar Dec 01 '17 16:12 RickMoynihan