qa-catalogue icon indicating copy to clipboard operation
qa-catalogue copied to clipboard

Filter out some fields

Open pkiraly opened this issue 4 years ago • 5 comments

Radek Světlík (Education and Research Library in Pilsen, Czech Republic) wrote: " I would like to ask whether you can recommend how to ommit delibarately some fields from validation".

The solution would be a new parameter, called --ignore-elements which would accept a list of tags and subfields separated by a colon, such as

--ignore-elements "100$a;650;651;700$a;700$2"

pkiraly avatar Jun 27 '20 10:06 pkiraly

We now have the --ignorableFields parameter, does this solve this issue?

nichtich avatar Jun 28 '23 10:06 nichtich

--ignorableFields ignores the whole field, but here what we need is ignoring only subfields. There is a relevant request from Gent: "Can you disregard field 852 (ind1=4) from the subfield check, as you’ve done for the undefined field check?" which could fit the following pattern: ignoring if happens.

So the best would be to rename/improve ignorableFields with two new features:

  • specify any types of data elements (field, subfield, indicator, control field position) to ignore
  • specify a condition

pkiraly avatar Jun 28 '23 11:06 pkiraly

We have the same requirement but it's a can of worms: especially conditions can get quite complex. One the other hand you already have a language to specify data elements and rules this this could be reused, e.g. --ignore-elements cleanup.yaml with cleanup.yaml being like this:

format: MARC
fields:
- name: custom-fields # optional name
  path: 900 # element to remove
- path: 040$a # element to remove
  rules: # optional rules
  - id: 040$a.pattern # optional id
    pattern: ^BE-KBR00 # only remove if value matches this pattern

By the way I'd like to also get this as standalone application to filter a file.

nichtich avatar Jun 28 '23 11:06 nichtich

yes, good idea.

pkiraly avatar Jun 28 '23 12:06 pkiraly

The syntax implemented in PicaFilter.java has not been documented yet and it could be extended to common syntax also used to formulate queries. The same syntax could also be used in Catmandu and pica-rs. Here is an excerpt of the documentation of pica-rs (which goes beyond this) I contributed earlier:

The basic building block of filter expressions are field expressions, which consists of a field tag (e.g. 003@), an optional occurrence (e.g /03), and a subfield filter.

A simple field tag consists of level number (0, 1, or 2) followed by two digits and a character (A to Z and @). The dot (.) can be used as wildcard for any character and square brackets can be used for alternative characters (e.g. 04[45]. matches all fields starting with 044 or 045 but no occurrence).

Occurrence /00 and no occurence are equivalent, /* matches all occurrences (including zero) and /01-10 matches any occurrences between /01 and /10. Exception: if the field tag starts with 2, no occurrence is read as /* instead of /00.

Simple subfield filter consists of the subfield code (single alpha-numerical character, ex 0) a comparison operator (equal ==, not equal != not equal, starts with prefix =^, ends with suffix =$, regex =~/!~, in and not in) and a value enclosed in single quotes. These simple subfield expressions can be grouped in parentheses and combined with boolean connectives (ex. (0 == 'abc' || 0 == 'def')).

A special existence operator can be used to check if a given field (012A/00?) or a subfield (002@$0? or [email protected]?) exists. To test for the number of times a field or subfield exists in a record or field respectively, use the cardinality operator # with a comparison operator (e.g. #010@ > 1).

Field expressions can be combined to complex expressions by the boolean connectives AND (&&) and OR (||). Boolean expressions can be grouped with parenthesis. Precedence of AND is higher than OR, so A || B && C is equivalent to A || (B && C). Expressions are evaluated lazy from left to right so given A || B if A is true than B will not be evaluated.

nichtich avatar Sep 22 '23 09:09 nichtich