openlibrary icon indicating copy to clipboard operation
openlibrary copied to clipboard

Parsing `]` in MARC datafield 260 (Imprint)

Open scottbarnes opened this issue 2 years ago • 2 comments

I am not sure that this is a bug, but I noticed when parsing MARC records that ] in places names is not removed. For example, consider this MARC record:

<datafield tag="260" ind1=" " ind2=" ">
    <subfield code="a">[Paris] :</subfield>
    <subfield code="b">Gallimard,</subfield>
    <subfield code="c">DL 2017</subfield>
    <subfield code="e">(impr. en Italie)</subfield>
</datafield>

After parsing, [Paris] ultimately becomes "publish_places": ["Paris]"] in the edition data (note the ] in "Paris]".

This is likely because publish_place() doesn't strip ]:

def publish_place(s: str) -> str:
    place = s.strip(' /.,;:[')
    if place.lower().startswith('s.l'):  # Sine loco
        place = '[s.l.]'
    return place

However, my hope for a one character solution was almost immediately dashed, because stripping ] was not the solution as this pytest output shows:

AssertionError: Processed binary MARC values do not match expectations in /openlibrary/openlibrary/catalog/marc/tests/test_data/bin_expect/wwu_51323556.json. Key: publish_places
assert 'Oxford [England]' in ['Oxford [England', 'New York']

One naive solution is to strip ] only when there is a single place name, which makes the tests pass, but who knows how terrible it is as a solution:

def publish_place(s: str) -> str:
    # Strip ] in "[Paris]" but leave it in "Oxford [London]".
    place = s.strip(' /.,;:[]') if len(s.split()) == 1 else s.strip(' /.,;:[')
    if place.lower().startswith('s.l'):  # Sine loco
        place = '[s.l.]'
    return place

Thoughts? I can submit a small PR if things are so simple as I have made them appear, but as with many things related to MARC parsing, I suspect they are not.

Stakeholders

@hornc

scottbarnes avatar Aug 07 '23 00:08 scottbarnes

I think a better solution may be, if the string starts with [ and ends with ] then take the string[1:-1]. Then we should be able to apply the existing rule as-is, but without [ as our previous rule should have handled that, e.g. place = s.strip(' /.,;:')

def publish_place(s: str) -> str:
    if (place[0], place[-1]) == ('[', ']'):
        place = place[1:-1] 
    place = s.strip(' /.,;:')
    if place.lower().startswith('s.l'):  # Sine loco
        place = '[s.l.]'
    return place

mekarpeles avatar Sep 27 '23 19:09 mekarpeles

I think a better solution may be, if the string starts with [ and ends with ] then take the string[1:-1].

The problem with this is that the square brackets can be split across subfields and/or be reordered, so it's important that this rule be applied before any other transform.

The code needs to deal with cases like:

[Smith, John] => John] [Smith $a[s.d., $bParis] => date: "[s.d" place: "Paris]"

tfmorris avatar Aug 25 '24 21:08 tfmorris