dpv
dpv copied to clipboard
Add script to find terms without an example
Pull Request
From DPV 2.3, new concepts should have an example (see 13 Aug 2025 meeting).
This 295_find_terms_without_example.py script will find terms without an example in the examples directory.
Update: The script will also find terms used in an example but undefined in vocab files (a potential typo).
Running without option at command line, it will show numbers of terms with an example (or that one of their parents has an example):
Namespace Class w/ Examples Prop. w/ Examples
-------------------------- ----------------- -----------------
ai 4 / 183 1 / 10
dpv 442 / 962 113 / 144
eu-aiact 7 / 105 2 / 2
eu-dga 20 / 62 5 / 5
eu-ehds 17 / 61 0 / 0
eu-gdpr 87 / 217 6 / 6
eu-nis2 0 / 12 0 / 0
eu-rights 1 / 137 0 / 0
justifications 20 / 66 0 / 0
legal-eu 8 / 21 0 / 0
loc 146 / 5270 0 / 4
p7012 38 / 136 4 / 21
pd 70 / 221 0 / 0
risk 62 / 491 13 / 43
sector-education 7 / 49 0 / 0
sector-finance 10 / 43 0 / 0
sector-health 3 / 59 0 / 0
sector-infra 1 / 47 0 / 0
sector-law 2 / 94 0 / 0
sector-publicservices 3 / 15 0 / 0
tech 14 / 127 5 / 52
Top parents among classes without examples (excluding 'loc:'):
49 risk:RiskMatrix7x7
25 risk:RiskMatrix5x5
25 risk:ServiceRelatedConsequence
24 justifications:LegalProcessImpaired
20 tech:Actor
17 dpv:CryptographicMethods
16 dpv:SecurityMethod
16 risk:Discrimination
16 dpv:PublicBenefit
16 dpv:DataTransferLegalBasis
Top parents among properties without examples:
18 tech:hasActor
6 risk:controls
5 dpv:hasData
4 ai:hasAI
4 skos:altLabel
3 tech:hasInputData
3 ai:hasData
3 risk:resolves
2 risk:reduces
2 tech:hasInput
Running with -v option, it will print all the terms without an example:
==== Properties without examples ====
ai:hasAISystem ⊂ ai:hasAI
ai:hasCapability ⊂ ai:hasAI
ai:hasData ⊂ dpv:hasData
ai:hasGPAIModel ⊂ ai:hasModel
ai:hasModel ⊂ ai:hasAI
ai:hasTechnique ⊂ ai:hasAI
ai:hasTestingData ⊂ ai:hasData, tech:hasInputData
ai:hasTrainingData ⊂ ai:hasData, tech:hasInputData
ai:hasValidationData ⊂ ai:hasData, tech:hasInputData
dpv:hasConformanceStatus
dpv:hasData
dpv:hasDataSubjectScale ⊂ dpv:hasScale
...
Not exactly useful yet since it can't distinguished the new terms (from one version to another). Eventually, once we have a "sinceVersion" information (see #359), we may able to show only new terms without an example.
This is cool, I'll run it on 2.2 later. I think this is taking all the RDF outputs and checking which terms occur (anywhere) in examples?
We also have the Examples CSV/RDF which contains dct:subject for what the example is about e.g. https://github.com/w3c/dpv/blob/5f2f7e9aaf06c602b43d53cff583d2ad456c36a6/2.1/examples/dex.ttl#L51. Its not enough that the term is mentioned in the example because the description for the example will be explaining something about the concept as well. So would it be easier to maintain that we check the dct:subject of examples and then ensure that there is an example for the concept (ideal) or parent (rdf:type)?
We can use #12 for the general discussion on what to use for example / use-case and how to script tests around it.
This is cool, I'll run it on 2.2 later. I think this is taking all the RDF outputs and checking which terms occur (anywhere) in examples?
Yes. Since the TTLs in examples/ directory are not in full form, I just match terms with a regular expression (without actual Turtle/RDF parsing). Will put this in code comment.
I will check the dex. I have look at it before but only see a description and a link to an actual TTL example file, so I use TTLs in /examples/ (at root) instead. Will look at the code again.
What do you mean by full form? They should be valid as turtle - except the name spaces which are taken from the csvs.
What do you mean by full form? They should be valid as turtle - except the name spaces which are taken from the csvs.
Sorry. I should use another word.
It is valid, but since we don't declare ex namespace anywhere in the TTL, rdflib can't parse it. I got an error at that point.
I see. It's possible to make them fully conformant turtle files. My worry was that this might take up too much space in the html, but I can truncate the namespaces there via code. Please open an issue for this and I'll implement it later. Though do we need this for v2.2 or can it be done for v2.3? I prefer later as this might break stuff.
I prefer 2.3. No rush since it will take more time to actually have more examples anyway.
The code is updated to cover a case like in #371.
This is what it reports from 2.2 draft:
==== Terms used in examples but NOT defined in vocabulary files ====
2021-05-28T12:24 in: E0023.ttl
2022-09-06T15:36 in: E0016.ttl
dpv-gdpr:SCCsByCommission in: E0025.ttl
dpv-juris:Ireland in: E0019.ttl
dpv:CompanyA in: E0035.ttl
dpv:CompanyB in: E0035.ttl
dpv:ContractAccepted in: E0077.ttl
dpv:ContractOfferReceived in: E0077.ttl
dpv:ContractPartiallyAccepted in: E0077.ttl
dpv:ContractUnfulfilled in: E0077.ttl
dpv:Controller in: E0032.ttl
dpv:Email in: E0015.ttl
dpv:FraudPreventionDetection in: E0031.ttl, E0034.ttl, E0041.ttl, E0065.ttl
dpv:Harm in: E0027.ttl
dpv:IE in: E0049.ttl
dpv:Identifier in: E0044.ttl
dpv:Incident in: E0069.ttl
dpv:JointControllerAgreement in: E0034.ttl
dpv:LossOfData in: E0068.ttl, E0069.ttl
dpv:SomeContract in: E0078.ttl
dpv:Subsidiary in: E0038.ttl
dpv:SystemicMonitoring in: E0013.ttl
dpv:TransferStatistics in: E0024.ttl
dpv:hasConsequenceOfFailure in: E0061.ttl
dpv:hasProcessingContext in: E0013.ttl
dpv:hasStorageLocation in: E0072.ttl
dpv:hasThirdPartyRecipient in: E0034.ttl
dpv:isImplementedByUsingTechnology in: E0060.ttl, E0064.ttl
exA:TVServiceOptimisation in: E0004.ttl
exB:TVServiceOptimisation in: E0004.ttl
exB:TVSignalOptimisation in: E0004.ttl
is:dpv in: E0004.ttl
iso:IE in: E0019.ttl
legal-eu:GDPR in: E0036.ttl, E0055.ttl, E0067.ttl
loc:USA in: E0060.ttl
nace:M72 in: E0008.ttl
new-profile:Anonymise in: E0030.ttl
new-profile:Use in: E0030.ttl
pd:Email in: E0006.ttl, E0022.ttl, E0023.ttl, E0026.ttl, E0072.ttl
policy:1 in: E0030.ttl
risk:DataBreachReport in: E0063.ttl
risk:MisuseBreachedInformation in: E0068.ttl, E0069.ttl, E0071.ttl
risk:halts in: E0086.ttl
Note that not all of them are actually undefined. Few may be just a typo or some can be intentional or they are just come from another vocabulary (e.g., odrl). For example:
- The first two in the list are legit literals (datetime)
is:dpvin E0004.ttl, from text "the common ancestor is:dpv:OptimisationForConsumer" (a space is needed after is:dpv)iso:IEin: E0019.ttl, this looks intentional- exA/B:TV* in E0004.ttl, also looks intentional.
Thanks @bact -- looking super helpful. For undefined terms in examples https://github.com/w3c/dpv/pull/358#issuecomment-3201352113 -- these should be fixed, yes? Same issue as #371?
I think so. Mind that some of them are false positives, due to limitations of regex matching. For example, "policy:1" is from odrl:uid <https://example.com/policy:1> which I think should be fine.
Thanks, I'll change what I can find from these. There is code somewhere in the existing setup that distinguishes between "DPV concepts" and others based on namespaces in order to generate RDF or HTML. Later when looking at this, I'll see how to use that here as then we won't have to rely on regex and broken/invalid RDF will also be flagged automatically.
I can take some of these. Already look at legal-eu:GDPR, pd:Email, and few more at PR #373. The rest looks more complicated.
Thanks @bact -- added the changes in https://github.com/w3c/dpv/commit/22222befed48eb5cd9f8df60eef795cbc366a27c Can you please mark the unresolved ones, or ideally move them to #372 and I'll take a look.
I will move the unresolved ones to #372 to better track that.
Update: Done. All remaining undefined terms are here: https://github.com/w3c/dpv/issues/372#issuecomment-3205734700
Tested this for fixing typos, brilliant stuff @bact - very helpful! Some minor issues which are easily recognised and ignored:
2021-05-28T12:24 in: E0023.ttl <--- ignore strings i.e. starting with ""
2022-09-06T15:36 in: E0016.ttl <--- ignore strings i.e. starting with ""
dpv:DataLoss in: E0068.ttl, E0069.ttl
eu-gdpr:A6 in: E0072.ttl <--- terms can contain dashes e.g. A6-xyz
eu-gdpr:A7 in: E0072.ttl <--- terms can contain dashes e.g. A7-xyz
exA:TVServiceOptimisation in: E0004.ttl <--- example prefixes can be of the form ex, exA, ... exF
exB:TVServiceOptimisation in: E0004.ttl <--- example prefixes can be of the form ex, exA, ... exF
exB:TVSignalOptimisation in: E0004.ttl <--- example prefixes can be of the form ex, exA, ... exF
legal-eu:law in: E0036.ttl, E0055.ttl, E0067.ttl, E0072.ttl, E0076.ttl <--- terms can contain dashes
legal-ie:DPA in: E0036.ttl <--- terms can contain dashes
legal-ie:law in: E0036.ttl <--- terms can contain dashes
nace:72 in: E0008.ttl <--- to add nace prefix in our list
policy:1 in: E0030.ttl <--- syntax error in IRI, fixed
The numbering of the script (2xx) is a bit confusing though since there is one part (the check of possible undefined terms in HTML) that should be run after the HTML generation script (300).
Can think more about this for 2.3.
@bact How about -- we make all 4xx numbered scripts be for testing -- so any tests go there, including the current 290 for SHACL and future ones like a fork of OOPS!/FOOPS! specifically for DPV that I'm planning to write. The current logic is that 1xx is data retrieval from GSheets, 2xx is RDF output, and 3xx is HTML output, then 9xx is for releases. In the future, we will need further numbers for tools/implementations e.g. if we host something interactive or provide a library or something -- these can take up 5xx -- 8xx.
What about
- 1xx data retrieval from source
- 2xx RDF generation (for machine)
- 3xx HTML generation (for human)
- 4xx Test
- 41x Test retrieved data
- 42x Test generated RDF
- 43x Test generated HTML
But this will make the possible numbers for each category limited to only 9 and difficult to allocate in a forward-compatible way (keep same numbers in the future).
Isn't the issue here that the script tests both RDF and HTML -- and then whether it should be 42x or 43x? I think the 4xx should not follow the numbering e.g. what if we have more than 10 RDF test files. There very likely won't be a lot of code here and we can put the workflow in the README or wiki if it isn't simple/intuitive. So the primary focus should be of usability for people working with/on it (e.g. you and me).
You're right. Keep it 4xx simple. Thanks.
Tested this for fixing typos, brilliant stuff @bact - very helpful! Some minor issues which are easily recognised and ignored:
2021-05-28T12:24 in: E0023.ttl <--- ignore strings i.e. starting with "" 2022-09-06T15:36 in: E0016.ttl <--- ignore strings i.e. starting with "" dpv:DataLoss in: E0068.ttl, E0069.ttl eu-gdpr:A6 in: E0072.ttl <--- terms can contain dashes e.g. A6-xyz eu-gdpr:A7 in: E0072.ttl <--- terms can contain dashes e.g. A7-xyz exA:TVServiceOptimisation in: E0004.ttl <--- example prefixes can be of the form ex, exA, ... exF exB:TVServiceOptimisation in: E0004.ttl <--- example prefixes can be of the form ex, exA, ... exF exB:TVSignalOptimisation in: E0004.ttl <--- example prefixes can be of the form ex, exA, ... exF legal-eu:law in: E0036.ttl, E0055.ttl, E0067.ttl, E0072.ttl, E0076.ttl <--- terms can contain dashes legal-ie:DPA in: E0036.ttl <--- terms can contain dashes legal-ie:law in: E0036.ttl <--- terms can contain dashes nace:72 in: E0008.ttl <--- to add nace prefix in our list policy:1 in: E0030.ttl <--- syntax error in IRI, fixed
Fixed the regex as suggested.
Run against the latest code in main repo:
Terms used in examples but NOT defined in vocabulary files (2)
----------------------------------------------------------
dpv:DataLoss in: E0068.ttl, E0069.ttl
nace:72 in: E0008.ttl
dpv:DataLoss should be risk:DataLoss will fix in another PR. -- PR #383