register icon indicating copy to clipboard operation
register copied to clipboard

Elasticsearch schema BODS data nested type

Open tiredpixel opened this issue 2 years ago • 0 comments

Similar to https://github.com/openownership/register/issues/225 , indexes such as bods_v2_psc_prod100 use nested field type for publicationDetails. Doing so makes exploration of the data more difficult, as well as complicating queries, since it prevents inner object flattening. I can't really see a reason why nested field types are used, in this case; this would need a little investigation.

I only just realised that publicationDetails.publicationDate gets set when statements are republished. This is admittedly in accordance with BODS 0.2. Perhaps I should have spotted this sooner, but I didn't, because publicationDate is buried within publicationDetails as a nested object, and even though I now know it's there, it's still hard to use it for data exploration or debugging, because of the field type.

I suggest that the usage of all nested fields types within BODS indexes in Elasticsearch is evaluated, to see whether the usage of such is in fact necessary or desirable. There might well be good reasons for some of them—identifiers comes to mind, for which the use of a nested field type is not only desirable but critical to correct results being returned. But some others, particularly those not modelled as arrays of objects—which requires special treatment in Elasticsearch since there is no dedicated array field type—would benefit from being re-evaluated.

Fields to check

  • [ ] addresses
  • [ ] annotations
  • [ ] identifiers (almost certainly correct, as noted above)
  • [ ] incorporatedInJurisdiction
  • [ ] interestedParty
  • [ ] interestedParty.unspecified
  • [ ] interests
  • [ ] interests.share
  • [ ] names
  • [ ] nationalities
  • [ ] pepStatusDetails
  • [ ] pepStatusDetails.source
  • [ ] pepStatusDetails.source.assertedBy
  • [ ] placeOfBirth
  • [ ] placeOfResidence
  • [ ] publicationDetails (likely incorrect, as noted above)
  • [ ] publicationDetails.publisher
  • [ ] source
  • [ ] source.assertedBy
  • [ ] subject
  • [ ] taxResidencies
  • [ ] unspecifiedEntityDetails
  • [ ] unspecifiedPersonDetails

Indexes to migrate

  • [ ] bods_v2_psc_prod100
  • [ ] bods_v2_dk_prod100
  • [ ] bods_v2_sk_prod100
  • [ ] bods_v2_am_prod100

Index templates

Given the number of affected indexes, which all contain the same mappings, this is likely a good time to consider using Elasticsearch index templates instead. This would enable mappings to be updated centrally and apply automatically to all indexes. Doing so would also eliminate the need to run multiple 'create indexes' steps within the various transformers.

References https://github.com/openownership/register/issues/189 , during which this was re-discovered.

tiredpixel avatar Dec 06 '23 14:12 tiredpixel