qa-catalogue icon indicating copy to clipboard operation
qa-catalogue copied to clipboard

PICA: subject indexing analysis

Open pkiraly opened this issue 2 years ago • 3 comments

According to Maquis plan PICA contains subject indexing information in the following fields 045A, 045B, 045F, 045R.

pkiraly avatar Aug 16 '22 19:08 pkiraly

@nichtich Could you adjust this list?

pkiraly avatar Aug 16 '22 19:08 pkiraly

The K10plus subset limited to subject information, published as lists the following fields, related to content of a publication in its broadest sense:

003@ with internal record identifier “PPN” in subfield $0
013D type of content
013F target audience
041A keywords
044. all subject indexing fields starting with 044
045. all subject indexing fields starting with 045
144Z local library keywords
145S local library classification
145Z local library classification

Meanwhile we further indentified:

010@ language
013E musical type of document
013H additional type of document
017G and 017HURL for catalog enrichment (e.g. table of contents)
047I abstract

The definition of subject fields for MQA is likely more strict, so I'd say:

041A keywords
044. all subject indexing fields starting with 044
045. all subject indexing fields starting with 045
144Z local library keywords
145S local library classification

The current list of 044. and 045. fields can be obtained via:

curl https://format.k10plus.de/avram.pl?profile=k10plus-title | jq -r '.fields|keys[]' | grep '^04[45]'

nichtich avatar Aug 17 '22 06:08 nichtich

@nichtich wrote:

Attached a file that could be used as configuration file for MQA for subject headings. Each entry has

  • PICA : which pica field(s) (e.g. 045B
  • URI : which vocabulary (BARTOC entry, more details about vocabulary can be found there)
  • prefLabel: name of vocabulary
  • ID: where to take local identifier/notation from
  • notationPattern : regular expression to check local identifier/notation
  • namespace : URI namespace (for some vocabularies)
  • SRC : not relevant to this question
  • VOC : not relevant to this question

The ID is given as regular expression including subfield code and with capturing group, e.g.

"PICA": "044L/00-99",
"ID": "^7gnd/(.+)"

means that all fields 044L (any occurrence) are GND and the GND identifier is in subfield $7 preceeded by "gnd/".

The regular expression to check valid GND identifier could be part of the Avram schema but this is another task.

[
  {
    "ID": "^a(.+)",
    "PICA": "045A",
    "SRC": "^A(.+)",
    "VOC": "lcc",
    "notationPattern": "[A-Z]{1,3}([0-9]+(\\.[0-9]+)?( *\\.?[A-Z]{0,3}[0-9]*([ -]\\.?[A-Z]{0,3}[0-9]+)?)?( *[0-9]+[a-z]*)?)?",
    "namespace": "http://id.loc.gov/authorities/classification/",
    "prefLabel": { "en": "Library of Congress Classification" },
    "uri": "http://bartoc.org/en/node/486"
  },
  ...
]

pkiraly avatar Sep 23 '22 16:09 pkiraly