qa-catalogue
qa-catalogue copied to clipboard
Detecting harmful words in subject indexing
This suggestion comes from KBR:
More and more institutions are talking about the ‘harmfull words’ in their catalogue. And how to clean them up/when/where?
In fact there are two things
- Harmful words in titles of publications (for example a book with th title ‘the dancing negro’: here we will not change title, but maybe add a disclaimer (for what it’s worth)
- Harmful words in subject indexing (6XX), for example we changed recently a subject term ‘negro art’ to ‘etnic art’ (or something like that).
For the latter, it would be useful if we can ‘upload’ a csv with harmful words (negro, gypsi, roma, Indians, Eskimo, ‘zwarte piet’ (dutch), etc etc) and then the tool does an analysis of the whole catalogue and gives back a csv of idn’s where that harmful words appear (in title of in subject indexing (6XX). Maybe together with the harmful words detected. Or a list of the harmful words detected (like know we get a list of the errors on marc21 validation), with then a csv of the idn. We can use then that list to correct our records.
It is maybe more complicated that that because some words are not harmful in context A, but are harmful in context B.
This sounds like a specific use case of the more generic issue full text search in selected fields. It requires:
- definition of an index covering selected subfields (e.g.
title
covers several subfields). This could be based on existing indexes in OPAC databases (e.g. in PICA CBS indexes are also defined from subfields). - extended user interface for search (https://github.com/pkiraly/qa-catalogue-web/issues/2)
- allow to submit a list of words from file in addition to simple form field
Some components are already available:
- one can download the list of record IDs matching the query
- the SHACL4Bib feature provides pattern matching and also field extraction What is missing (from the command line part) is to fetch a list of words from a file instead of listing in the config file.
So we have two options.
Right now fielded term search (only fielded phrase search) is not possible in the web interface, it requires not just a user interface change, but changing how we create the index.