openlibrary icon indicating copy to clipboard operation
openlibrary copied to clipboard

Some org author names are being incorrectly rearranged around commas on import

Open hornc opened this issue 1 year ago • 6 comments

Problem

Some imported author names from MARC imports are being treated as comma separated personal names when they are org names with parenthetical place clarifications. There may be other examples of similar problems.

The problem is that commas are assumed to indicate 'Last, First' personal names.

The fix would be to add some logic to distinguish between valid 'Last, First' name names and other names containing commas which should not be rearranged.

To do this we probably need some more examples.

Although, since MARC 710 is 710 - Added Entry-Corporate Name, it makes sense that this field should never be rearranged as a personal name... needs some thought.

example: China). Zhongguo shi ge yan jiu zhong xin Shou du shi fan da xue (Beijing

https://openlibrary.org/authors/OL14087565A/China).Zhongguo_shi_ge_yan_jiu_zhong_xin_Shou_du_shi_fan_da_xue(Beijing

Original MARC:

https://openlibrary.org/show-records/harvard_bibliographic_metadata/ab.bib.10.20150123.full.mrc:134034844:1323

The name is taken from 710$

710 2  $6880-04$aShou du shi fan da xue (Beijing, China).$bZhongguo shi ge yan jiu zhong xin.

And '(Beijing, China)' is being treated and rearranged as if it were 'Last, First'

Reproducing the bug

  1. Go to ...
  2. Do ...
  • Expected behavior:

The resultant name should be more like:

Shou du shi fan da xue (Beijing, China) – Zhongguo shi ge yan jiu zhong xin

or 首都师范大学 (Beijing, China) – 中国诗歌硏究中心

  • Actual behavior:

Context

  • Browser (Chrome, Safari, Firefox, etc):
  • OS (Windows, Mac, etc):
  • Logged in (Y/N):
  • Environment (prod, dev, local): prod

Breakdown

Requirements Checklist

  • [ ]

Related files

Stakeholders


Instructions for Contributors

  • Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.

hornc avatar Aug 22 '24 20:08 hornc

The original characters in this example are also not being imported, that would be a new feature, but there may be existing code for other fields to fetch 880 original script values.

880 2  $6710-04$a首都师范大学 (Beijing, China).$b中国诗歌硏究中心.

Noting this in passing.

hornc avatar Aug 22 '24 20:08 hornc

The original characters in this example are also not being imported, that would be a new feature, but there may be existing code for other fields to fetch 880 original script values.

880 2  $6710-04$a首都师范大学 (Beijing, China).$b中国诗歌硏究中心.

That looks like a case that #7652 was intended to handle. It has the correct linkage in the $6 (710-04), so the Chinese script should have been associated with the author record as an alternate name (although I'd actually prefer to have it as the default name).

One strong reason to try and get this right is that I've yet to find any translation software that can deal with transliterate text (although it might be possible with a multi-step process), but if I paste the Chine into Google Translate, I get back the perfectly reasonable translation of "Capital Normal University (Beijing, China).$bChinese Poetry Research Center."

tfmorris avatar Aug 22 '24 21:08 tfmorris

Although, since MARC 710 is 710 - Added Entry-Corporate Name, it makes sense that this field should never be rearranged as a personal name... needs some thought.

Actually, there are specific guidelines for this field at https://www.loc.gov/marc/bibliographic/bdx10.html and they say that a First Indicator value of 2, which this 710 has, means "Name in direct order" so it definitely shouldn't be rearranged.

The full set of values is:

Type of corporate name entry element 0 - Inverted name 1 - Jurisdiction name 2 - Name in direct order

Meeting names have a similar set of indicators: https://www.loc.gov/marc/bibliographic/bdx11.html

tfmorris avatar Aug 22 '24 21:08 tfmorris

I haven't investigated in depth, but something suspicious that catches my eye is that the $6 subfield isn't listed for any of these entries:

https://github.com/internetarchive/openlibrary/blob/bf4bc9d8e9d4f3bda987215a310c959b23ca6d52/openlibrary/catalog/marc/parse.py#L580-L585

but in any case, it should be pretty quick to find with an appropriate test case or two...

tfmorris avatar Aug 22 '24 21:08 tfmorris

This is another example for an 880 original script test case: https://openlibrary.org/show-records/harvard_bibliographic_metadata/ab.bib.13.20150123.full.mrc:554571729:1305

hornc avatar Aug 25 '24 11:08 hornc

Thanks for your analysis on this issue @tfmorris !

I've been trying to add a failing test for the name ordering, and it keeps getting handled correctly in the current code. I think I fixed this recently in #9601 by refactoring, but before I noticed this specific version of the problem, this import seems to have occurred after the merge, but crucially before the fixes were deployed to production.

I think the name ordering is fixed by #9601 , but I will continue with the 880 original script improvement. I agree, I though we'd fixed this previously (it's working for titles and direct authors), looks like the 7XX fields missed out on the original script treatment. I'll rectify this.

hornc avatar Aug 25 '24 22:08 hornc

Another recent example of this issue: (for testing the fix) https://openlibrary.org/books/OL30706353M/Ubuntu_good_faith_and_equity and org/conference author: https://openlibrary.org/authors/OL8657904A/South_Africa)Humboldt_Kolleg_Interdisciplinary_Conference(2010_Potchefstroom

hornc avatar Sep 03 '24 20:09 hornc