Parsing `]` in MARC datafield 260 (Imprint)
I am not sure that this is a bug, but I noticed when parsing MARC records that ] in places names is not removed. For example, consider this MARC record:
<datafield tag="260" ind1=" " ind2=" ">
<subfield code="a">[Paris] :</subfield>
<subfield code="b">Gallimard,</subfield>
<subfield code="c">DL 2017</subfield>
<subfield code="e">(impr. en Italie)</subfield>
</datafield>
After parsing, [Paris] ultimately becomes "publish_places": ["Paris]"] in the edition data (note the ] in "Paris]".
This is likely because publish_place() doesn't strip ]:
def publish_place(s: str) -> str:
place = s.strip(' /.,;:[')
if place.lower().startswith('s.l'): # Sine loco
place = '[s.l.]'
return place
However, my hope for a one character solution was almost immediately dashed, because stripping ] was not the solution as this pytest output shows:
AssertionError: Processed binary MARC values do not match expectations in /openlibrary/openlibrary/catalog/marc/tests/test_data/bin_expect/wwu_51323556.json. Key: publish_places
assert 'Oxford [England]' in ['Oxford [England', 'New York']
One naive solution is to strip ] only when there is a single place name, which makes the tests pass, but who knows how terrible it is as a solution:
def publish_place(s: str) -> str:
# Strip ] in "[Paris]" but leave it in "Oxford [London]".
place = s.strip(' /.,;:[]') if len(s.split()) == 1 else s.strip(' /.,;:[')
if place.lower().startswith('s.l'): # Sine loco
place = '[s.l.]'
return place
Thoughts? I can submit a small PR if things are so simple as I have made them appear, but as with many things related to MARC parsing, I suspect they are not.
Stakeholders
@hornc
I think a better solution may be, if the string starts with [ and ends with ] then take the string[1:-1]. Then we should be able to apply the existing rule as-is, but without [ as our previous rule should have handled that, e.g. place = s.strip(' /.,;:')
def publish_place(s: str) -> str:
if (place[0], place[-1]) == ('[', ']'):
place = place[1:-1]
place = s.strip(' /.,;:')
if place.lower().startswith('s.l'): # Sine loco
place = '[s.l.]'
return place
I think a better solution may be, if the string starts with
[and ends with]then take the string[1:-1].
The problem with this is that the square brackets can be split across subfields and/or be reordered, so it's important that this rule be applied before any other transform.
The code needs to deal with cases like:
[Smith, John] => John] [Smith $a[s.d., $bParis] => date: "[s.d" place: "Paris]"