MARC 100 vs 700 author / contributor inconsistency
Compare:
- https://github.com/hornc/openlibrary-1/blob/880_alternate_scripts/openlibrary/catalog/marc/tests/test_data/bin_expect/880_arabic_french_many_linkages.json
- https://github.com/hornc/openlibrary-1/blob/880_alternate_scripts/openlibrary/catalog/marc/tests/test_data/bin_expect/880_Nihon_no_chasho.json
Why are the other 700 field contributors not being imported as authors in the same way as the 700s from the Nihon no chasho example?
It looks like in the 100 + 700s case, only the 100 individual is made an author, the 700s are contributors.
In the only 700s case, each 700 is added as an equal author.
Is this desired behavior?
It looks like treating 700s as contributors unless there is no main1xx entries is deliberate behavior. I can't find anything that confirms either way that this is a correct or incorrect assumption. 700s seem flexible, and although there is provision for (multiple?) subfields to state the exact relationship of the name to the record -- https://www.loc.gov/marc/bibliographic/bd700.html these don't have to exist, and in practice often don't.
It seems like it works out, but there is a risk some contributors (illustrators / translators) might get added as authors, and conversely some equally responsible authors may get added as mere contributors.
It looks like there isn't a clear way to indicate all the possibilities in MARC, or at least actual cataloging practice varies considerably.
I won't change the 1xx / 7xx behavior in this PR. If a field is picked as an author rather than in the contributions list, and an 880 alternate script version exists, it will now be added to the author dict as an alternate_name, regardless of 1xx or 7xx.
contributions on editions are just plain text lists of single names and don't have room for extra annotations. Work authors of https://openlibrary.org/type/author_role look like they would handle this better, but the role field is not currently used by any of the imports (AFAIK).
Originally posted by @hornc in https://github.com/internetarchive/openlibrary/pull/7652#discussion_r1139885220
I've been keeping an eye out for these and I'm pretty sure it's currently being done wrong / sub-optimally. I'm not sure if different catalogers use different rules, but there definitely seem to be a number of instances where only the first author goes in the 100 and all the rest go in the 700. Of course, 7xx fields with a relator of "illustrator", etc should stay in the contributions and not get promoted to authors.
Here are some examples that I've come across:
-
Compare 245$c
by_statementwith 100 and 700 and resulting OpenLibrary record https://openlibrary.org/works/OL6044872W/Key_issues_in_the_new_knowledge_management?_compare=Comparer&b=5&a=4&m=diff https://openlibrary.org/show-records/talis_openlibrary_contribution/talis-openlibrary-contribution.mrc:940898024:809 -
Similar case (arguably the 710s should be added along with the 700). There are four different MARC records with similar data. https://openlibrary.org/works/OL11150773W/Ground-water_data_for_West_Virginia_1974-84?_compare=Comparer&b=3&a=2&m=diff https://openlibrary.org/show-records/marc_oregon_summit_records/catalog_files/osu_bibs.mrc:769134252:1338
It looks like treating 700s as contributors unless there is no main1xx entries is deliberate behavior. I can't find anything that confirms either way that this is a correct or incorrect assumption.
The 1xx is the "Main entry" and what it is, and whether or not it exists, is determined by the cataloging rules which were used (e.g. AACR) which vary by time and geography. There's also the possibility that the cataloger didn't follow the rules that they were supposed to. Because there can only be a single 1xx, equal co-authors are always going to end up in 7xx fields.
Since OpenLibrary wants to list all authors, not just whoever is identified in the main entry, I think it makes sense to include all 7xx's EXCEPT those which can be clearly identified as non-author/editor contributors like illustrators, translators, etc.
I don't think it'll ever be possible to do it perfectly by reverse engineering human provided data with unknown cataloging rules, but I think it's possible to improve on the current situation.
Just taking a look at this again and I don't think the examples match the description. The Arabic/French example doesn't match the 100+700 description since it only has (three) 700s and a 710.
The binary MARC record from the test suite for the first example above is: https://github.com/hornc/openlibrary-1/blob/880_alternate_scripts/openlibrary/catalog/marc/tests/test_data/bin_input/880_arabic_french_many_linkages.mrc
There are two online examples which are easier to visualize: LC https://openlibrary.org/show-records/marc_loc_2016/BooksAll.2016.part37.utf8:212405343:2979 Columbia https://openlibrary.org/show-records/marc_columbia/Columbia-extract-20221130-017.mrc:84193714:3643
The binary for the second example is: https://github.com/hornc/openlibrary-1/blob/880_alternate_scripts/openlibrary/catalog/marc/tests/test_data/bin_input/880_Nihon_no_chasho.mrc and it's online at: https://openlibrary.org/show-records/marc_columbia/Columbia-extract-20221130-008.mrc:340428848:1828
In addition to the $0's for authors which we already discuss in #7724 these show the possibility of adding dates (one of the authors has a new death date) and alternate script names to existing author records. I'm not sure if attempting to improve/upgrade existing author records is something that should be done, but it's worth considering.
@tfmorris + @hornc is there a proposal you might like us to consider? Grateful for the investigation and would open to any steps we might be able to take to make this issue actionable (it's unclear to me what steps might result in resolution).
@mekarpeles, it's been a while, but I think the next steps for this are to merge #9797 which makes many of the specific improvements that have been raised, but leaves some room for improving 700s depending on the data sources.
#9797 stalled on test framework issues, but those are no longer blocking it.
The bulk of my suggestions are in https://github.com/internetarchive/openlibrary/issues/7723#issuecomment-2305862018, many of which are addressed by @hornc 's PR #9797 which I just re-reviewed at the request of @scottbarnes .