data-standard Identifier.id should be a required field

First up - I've checked this against master and 0.2-dev and looked through existing issues to no avail, but please correct me if I'm wrong in reporting this :) (either if this is the wrong place, it's already known, or it's not a bug).

In the Identifiers component of Entity statements, there are three possible configurations of data that are listed as being valid: scheme, scheme name or both scheme and scheme name. Is this not missing the obvious that id, i.e. the actual value of the identifier, should be required too?

It seems like the validation originally came from: https://github.com/openownership/data-standard/commit/8a93fe20b0ac9884d9ae655d6f1bb22003a1027b, but I can't see any particular rationale for excluding id there.

Apr 15 '19 11:04 stevenday

Thanks Steve: Ideally we would require identifiers but the thinking here came from work on loose validation (#45 and #80 ), which was that we would be dealing with a lot of legacy and user-inputed data where identifiers were either missing or not collected at all. I think this still holds. Checking on Companies House bulk downloads (although using a 2018 file!) using jq, around 15% of companies declared as RLEs don't have a registration number (although this will include some unregistered entities etc that have been declared as RLEs but that won't have identifiers).

So we are trying to do two things in the identifier block:

have a structure for good identifiers that is consistent with the identifier block in other standards.
retain the information that will help with identifying an entity even when the identifier itself is missing ("this is where the identifier is most likely to be").

Having read your thoughts on this, I'm coming round to the idea that (2) has created a confusing structure because we are overloading that identifier block to give unvalidated and partial information, as well as good identifiers.

We could, therefore, go back to one of the original suggestions, which was to have incorporatedInCompanyRegister as a top-level property for entities where no identifier exists. This could be either a scheme from org-id or a free text field, allowing us to represent partial information at different levels of certainty:

`"incorporatedInCompanyRegister": {
  "name": "Jebel Ali Free Zone"
}

or

`"incorporatedInCompanyRegister": {
  "scheme": "GB-COH"
}

The identifier block itself could then require an id, with one or both of scheme or schemeName.

@siwhitehouse The working on seeing how many companies declared as RLEs don't have registration numbers. (If you group by date, these are pretty much all from 2016 so this may have changed. Or it could be that snapshot is ordered by date - I haven't checked that!)

jq -s '[.[]] | map(select((.data.kind=="corporate-entity-person-with-significant-control") and (.data.identification.registration_number == null))) | length' psc-snapshot-2018-01-14_1of11.txt

May 03 '19 08:05 ScatteredInk

Thanks for taking the time to explain the history and rationale @ScatteredInk. I absolutely understand that there will be entities without identifiers, but it was the 'overloading' you identify that confused me.

I suppose this could also be alleviated by documentation, however I'd tend to agree that overloading Identifiers to mean two different things isn't ideal. Mapping this to the Register's DB, I'd have to bail out of creating our equivalent of an 'identifier' when the id is missing and put the information somewhere else, which would be messy and complicated code.

Regarding the alternative suggestion, how is incorporatedInCompanyRegister much different from incorporatedInJurisdiction? I'm just wondering if, in practice, it's any more likely to give me (as a consumer of the data) a better chance of uniquely identifying a company?

May 03 '19 10:05 stevenday

Just noting that the (not ideal) overloading will now be better documented in v0.2, since #159 will be fixed.

May 22 '19 15:05 kd-ods

Leaving this open for consideration in next upgrade.

Mar 23 '20 17:03 timgdavies

Reviewing this issue, my proposal is that for the next release of BODS:

identifier.id remains unrequired
The identifier.id description should be edited to read: "The identifier for this person or entity as provided in the declared scheme. An identifier SHOULD be published if known."
We update the Entity identifiers guidance to say that where an entity's id is unknown and the name of its registering body is not known (and therefore schemeName cannot be supplied) then jurisdiction SHOULD represent where it is incorporated, if known.

@siwhitehouse @ScatteredInk - can you see what you think of the above proposal, please? If viable, it can go on the backlog for 0.3.

Nov 01 '21 13:11 kd-ods

"The identifier for this person or entity as provided in the declared scheme. An identifier SHOULD be published if known."

This hasn't been added to the description yet.

We update the Entity identifiers guidance to say that where an entity's id is unknown and the name of its registering body is not known (and therefore schemeName cannot be supplied) then jurisdiction SHOULD represent where it is incorporated, if known.

We haven't added this either.

Is this still relevant?

May 13 '24 09:05 kathryn-ods

At this point, with the 0.4 release, I think the field descriptions are all ok.

I'm in favour of making id required for the next release. Just because it's required, doesn't mean an empty string can't be supplied. I've tagged this issue so that it can be considered as part of a general review of required fields.

May 20 '24 13:05 kd-ods