dpv icon indicating copy to clipboard operation
dpv copied to clipboard

Add script to find terms without an example

Open bact opened this issue 3 months ago • 21 comments

Pull Request

From DPV 2.3, new concepts should have an example (see 13 Aug 2025 meeting).

This 295_find_terms_without_example.py script will find terms without an example in the examples directory.

Update: The script will also find terms used in an example but undefined in vocab files (a potential typo).

Running without option at command line, it will show numbers of terms with an example (or that one of their parents has an example):

Namespace                  Class w/ Examples Prop. w/ Examples
-------------------------- ----------------- -----------------
ai                               4 / 183           1 / 10     
dpv                            442 / 962         113 / 144    
eu-aiact                         7 / 105           2 / 2      
eu-dga                          20 / 62            5 / 5      
eu-ehds                         17 / 61            0 / 0      
eu-gdpr                         87 / 217           6 / 6      
eu-nis2                          0 / 12            0 / 0      
eu-rights                        1 / 137           0 / 0      
justifications                  20 / 66            0 / 0      
legal-eu                         8 / 21            0 / 0      
loc                            146 / 5270          0 / 4      
p7012                           38 / 136           4 / 21     
pd                              70 / 221           0 / 0      
risk                            62 / 491          13 / 43     
sector-education                 7 / 49            0 / 0      
sector-finance                  10 / 43            0 / 0      
sector-health                    3 / 59            0 / 0      
sector-infra                     1 / 47            0 / 0      
sector-law                       2 / 94            0 / 0      
sector-publicservices            3 / 15            0 / 0      
tech                            14 / 127           5 / 52     

Top parents among classes without examples (excluding 'loc:'):
     49  risk:RiskMatrix7x7
     25  risk:RiskMatrix5x5
     25  risk:ServiceRelatedConsequence
     24  justifications:LegalProcessImpaired
     20  tech:Actor
     17  dpv:CryptographicMethods
     16  dpv:SecurityMethod
     16  risk:Discrimination
     16  dpv:PublicBenefit
     16  dpv:DataTransferLegalBasis

Top parents among properties without examples:
     18  tech:hasActor
      6  risk:controls
      5  dpv:hasData
      4  ai:hasAI
      4  skos:altLabel
      3  tech:hasInputData
      3  ai:hasData
      3  risk:resolves
      2  risk:reduces
      2  tech:hasInput  

Running with -v option, it will print all the terms without an example:

==== Properties without examples ====

ai:hasAISystem                           ⊂ ai:hasAI
ai:hasCapability                         ⊂ ai:hasAI
ai:hasData                               ⊂ dpv:hasData
ai:hasGPAIModel                          ⊂ ai:hasModel
ai:hasModel                              ⊂ ai:hasAI
ai:hasTechnique                          ⊂ ai:hasAI
ai:hasTestingData                        ⊂ ai:hasData, tech:hasInputData
ai:hasTrainingData                       ⊂ ai:hasData, tech:hasInputData
ai:hasValidationData                     ⊂ ai:hasData, tech:hasInputData
dpv:hasConformanceStatus
dpv:hasData
dpv:hasDataSubjectScale                  ⊂ dpv:hasScale
...

Not exactly useful yet since it can't distinguished the new terms (from one version to another). Eventually, once we have a "sinceVersion" information (see #359), we may able to show only new terms without an example.

bact avatar Aug 14 '25 08:08 bact

This is cool, I'll run it on 2.2 later. I think this is taking all the RDF outputs and checking which terms occur (anywhere) in examples?

We also have the Examples CSV/RDF which contains dct:subject for what the example is about e.g. https://github.com/w3c/dpv/blob/5f2f7e9aaf06c602b43d53cff583d2ad456c36a6/2.1/examples/dex.ttl#L51. Its not enough that the term is mentioned in the example because the description for the example will be explaining something about the concept as well. So would it be easier to maintain that we check the dct:subject of examples and then ensure that there is an example for the concept (ideal) or parent (rdf:type)?

coolharsh55 avatar Aug 14 '25 08:08 coolharsh55

We can use #12 for the general discussion on what to use for example / use-case and how to script tests around it.

coolharsh55 avatar Aug 14 '25 08:08 coolharsh55

This is cool, I'll run it on 2.2 later. I think this is taking all the RDF outputs and checking which terms occur (anywhere) in examples?

Yes. Since the TTLs in examples/ directory are not in full form, I just match terms with a regular expression (without actual Turtle/RDF parsing). Will put this in code comment.

I will check the dex. I have look at it before but only see a description and a link to an actual TTL example file, so I use TTLs in /examples/ (at root) instead. Will look at the code again.

bact avatar Aug 14 '25 10:08 bact

What do you mean by full form? They should be valid as turtle - except the name spaces which are taken from the csvs.

coolharsh55 avatar Aug 14 '25 10:08 coolharsh55

What do you mean by full form? They should be valid as turtle - except the name spaces which are taken from the csvs.

Sorry. I should use another word. It is valid, but since we don't declare ex namespace anywhere in the TTL, rdflib can't parse it. I got an error at that point.

bact avatar Aug 14 '25 12:08 bact

I see. It's possible to make them fully conformant turtle files. My worry was that this might take up too much space in the html, but I can truncate the namespaces there via code. Please open an issue for this and I'll implement it later. Though do we need this for v2.2 or can it be done for v2.3? I prefer later as this might break stuff.

coolharsh55 avatar Aug 14 '25 12:08 coolharsh55

I prefer 2.3. No rush since it will take more time to actually have more examples anyway.

bact avatar Aug 14 '25 12:08 bact

The code is updated to cover a case like in #371.

This is what it reports from 2.2 draft:

==== Terms used in examples but NOT defined in vocabulary files ====
2021-05-28T12:24                         in: E0023.ttl
2022-09-06T15:36                         in: E0016.ttl
dpv-gdpr:SCCsByCommission                in: E0025.ttl
dpv-juris:Ireland                        in: E0019.ttl
dpv:CompanyA                             in: E0035.ttl
dpv:CompanyB                             in: E0035.ttl
dpv:ContractAccepted                     in: E0077.ttl
dpv:ContractOfferReceived                in: E0077.ttl
dpv:ContractPartiallyAccepted            in: E0077.ttl
dpv:ContractUnfulfilled                  in: E0077.ttl
dpv:Controller                           in: E0032.ttl
dpv:Email                                in: E0015.ttl
dpv:FraudPreventionDetection             in: E0031.ttl, E0034.ttl, E0041.ttl, E0065.ttl
dpv:Harm                                 in: E0027.ttl
dpv:IE                                   in: E0049.ttl
dpv:Identifier                           in: E0044.ttl
dpv:Incident                             in: E0069.ttl
dpv:JointControllerAgreement             in: E0034.ttl
dpv:LossOfData                           in: E0068.ttl, E0069.ttl
dpv:SomeContract                         in: E0078.ttl
dpv:Subsidiary                           in: E0038.ttl
dpv:SystemicMonitoring                   in: E0013.ttl
dpv:TransferStatistics                   in: E0024.ttl
dpv:hasConsequenceOfFailure              in: E0061.ttl
dpv:hasProcessingContext                 in: E0013.ttl
dpv:hasStorageLocation                   in: E0072.ttl
dpv:hasThirdPartyRecipient               in: E0034.ttl
dpv:isImplementedByUsingTechnology       in: E0060.ttl, E0064.ttl
exA:TVServiceOptimisation                in: E0004.ttl
exB:TVServiceOptimisation                in: E0004.ttl
exB:TVSignalOptimisation                 in: E0004.ttl
is:dpv                                   in: E0004.ttl
iso:IE                                   in: E0019.ttl
legal-eu:GDPR                            in: E0036.ttl, E0055.ttl, E0067.ttl
loc:USA                                  in: E0060.ttl
nace:M72                                 in: E0008.ttl
new-profile:Anonymise                    in: E0030.ttl
new-profile:Use                          in: E0030.ttl
pd:Email                                 in: E0006.ttl, E0022.ttl, E0023.ttl, E0026.ttl, E0072.ttl
policy:1                                 in: E0030.ttl
risk:DataBreachReport                    in: E0063.ttl
risk:MisuseBreachedInformation           in: E0068.ttl, E0069.ttl, E0071.ttl
risk:halts                               in: E0086.ttl

Note that not all of them are actually undefined. Few may be just a typo or some can be intentional or they are just come from another vocabulary (e.g., odrl). For example:

  • The first two in the list are legit literals (datetime)
  • is:dpv in E0004.ttl, from text "the common ancestor is:dpv:OptimisationForConsumer" (a space is needed after is:dpv)
  • iso:IE in: E0019.ttl, this looks intentional
  • exA/B:TV* in E0004.ttl, also looks intentional.

bact avatar Aug 19 '25 16:08 bact

Thanks @bact -- looking super helpful. For undefined terms in examples https://github.com/w3c/dpv/pull/358#issuecomment-3201352113 -- these should be fixed, yes? Same issue as #371?

coolharsh55 avatar Aug 20 '25 05:08 coolharsh55

I think so. Mind that some of them are false positives, due to limitations of regex matching. For example, "policy:1" is from odrl:uid <https://example.com/policy:1> which I think should be fine.

bact avatar Aug 20 '25 07:08 bact

Thanks, I'll change what I can find from these. There is code somewhere in the existing setup that distinguishes between "DPV concepts" and others based on namespaces in order to generate RDF or HTML. Later when looking at this, I'll see how to use that here as then we won't have to rely on regex and broken/invalid RDF will also be flagged automatically.

coolharsh55 avatar Aug 20 '25 07:08 coolharsh55

I can take some of these. Already look at legal-eu:GDPR, pd:Email, and few more at PR #373. The rest looks more complicated.

bact avatar Aug 20 '25 07:08 bact

Thanks @bact -- added the changes in https://github.com/w3c/dpv/commit/22222befed48eb5cd9f8df60eef795cbc366a27c Can you please mark the unresolved ones, or ideally move them to #372 and I'll take a look.

coolharsh55 avatar Aug 20 '25 09:08 coolharsh55

I will move the unresolved ones to #372 to better track that.

Update: Done. All remaining undefined terms are here: https://github.com/w3c/dpv/issues/372#issuecomment-3205734700

bact avatar Aug 20 '25 11:08 bact

Tested this for fixing typos, brilliant stuff @bact - very helpful! Some minor issues which are easily recognised and ignored:

2021-05-28T12:24                         in: E0023.ttl <--- ignore strings i.e. starting with ""
2022-09-06T15:36                         in: E0016.ttl <--- ignore strings i.e. starting with ""
dpv:DataLoss                             in: E0068.ttl, E0069.ttl
eu-gdpr:A6                               in: E0072.ttl <--- terms can contain dashes e.g. A6-xyz
eu-gdpr:A7                               in: E0072.ttl <--- terms can contain dashes e.g. A7-xyz
exA:TVServiceOptimisation                in: E0004.ttl <--- example prefixes can be of the form ex, exA, ... exF
exB:TVServiceOptimisation                in: E0004.ttl <--- example prefixes can be of the form ex, exA, ... exF
exB:TVSignalOptimisation                 in: E0004.ttl <--- example prefixes can be of the form ex, exA, ... exF
legal-eu:law                             in: E0036.ttl, E0055.ttl, E0067.ttl, E0072.ttl, E0076.ttl <--- terms can contain dashes
legal-ie:DPA                             in: E0036.ttl <--- terms can contain dashes
legal-ie:law                             in: E0036.ttl <--- terms can contain dashes
nace:72                                  in: E0008.ttl <--- to add nace prefix in our list
policy:1                                 in: E0030.ttl <--- syntax error in IRI, fixed

coolharsh55 avatar Aug 24 '25 19:08 coolharsh55

The numbering of the script (2xx) is a bit confusing though since there is one part (the check of possible undefined terms in HTML) that should be run after the HTML generation script (300).

Can think more about this for 2.3.

bact avatar Aug 24 '25 20:08 bact

@bact How about -- we make all 4xx numbered scripts be for testing -- so any tests go there, including the current 290 for SHACL and future ones like a fork of OOPS!/FOOPS! specifically for DPV that I'm planning to write. The current logic is that 1xx is data retrieval from GSheets, 2xx is RDF output, and 3xx is HTML output, then 9xx is for releases. In the future, we will need further numbers for tools/implementations e.g. if we host something interactive or provide a library or something -- these can take up 5xx -- 8xx.

coolharsh55 avatar Aug 24 '25 20:08 coolharsh55

What about

  • 1xx data retrieval from source
  • 2xx RDF generation (for machine)
  • 3xx HTML generation (for human)
  • 4xx Test
    • 41x Test retrieved data
    • 42x Test generated RDF
    • 43x Test generated HTML

But this will make the possible numbers for each category limited to only 9 and difficult to allocate in a forward-compatible way (keep same numbers in the future).

bact avatar Aug 24 '25 20:08 bact

Isn't the issue here that the script tests both RDF and HTML -- and then whether it should be 42x or 43x? I think the 4xx should not follow the numbering e.g. what if we have more than 10 RDF test files. There very likely won't be a lot of code here and we can put the workflow in the README or wiki if it isn't simple/intuitive. So the primary focus should be of usability for people working with/on it (e.g. you and me).

coolharsh55 avatar Aug 24 '25 20:08 coolharsh55

You're right. Keep it 4xx simple. Thanks.

bact avatar Aug 25 '25 11:08 bact

Tested this for fixing typos, brilliant stuff @bact - very helpful! Some minor issues which are easily recognised and ignored:

2021-05-28T12:24                         in: E0023.ttl <--- ignore strings i.e. starting with ""
2022-09-06T15:36                         in: E0016.ttl <--- ignore strings i.e. starting with ""
dpv:DataLoss                             in: E0068.ttl, E0069.ttl
eu-gdpr:A6                               in: E0072.ttl <--- terms can contain dashes e.g. A6-xyz
eu-gdpr:A7                               in: E0072.ttl <--- terms can contain dashes e.g. A7-xyz
exA:TVServiceOptimisation                in: E0004.ttl <--- example prefixes can be of the form ex, exA, ... exF
exB:TVServiceOptimisation                in: E0004.ttl <--- example prefixes can be of the form ex, exA, ... exF
exB:TVSignalOptimisation                 in: E0004.ttl <--- example prefixes can be of the form ex, exA, ... exF
legal-eu:law                             in: E0036.ttl, E0055.ttl, E0067.ttl, E0072.ttl, E0076.ttl <--- terms can contain dashes
legal-ie:DPA                             in: E0036.ttl <--- terms can contain dashes
legal-ie:law                             in: E0036.ttl <--- terms can contain dashes
nace:72                                  in: E0008.ttl <--- to add nace prefix in our list
policy:1                                 in: E0030.ttl <--- syntax error in IRI, fixed

Fixed the regex as suggested.

Run against the latest code in main repo:

Terms used in examples but NOT defined in vocabulary files (2)
----------------------------------------------------------
dpv:DataLoss                             in: E0068.ttl, E0069.ttl
nace:72                                  in: E0008.ttl

dpv:DataLoss should be risk:DataLoss will fix in another PR. -- PR #383

bact avatar Sep 09 '25 22:09 bact