Amazon author + translator imported as single conflated author
Problem
This author: https://openlibrary.org/authors/OL9912016A.json was imported from Amazon in Nov 2021 with the obviously conflated name of "Rachel Kushner Suat Ertuzun."
While there are thousands upon thousands of conflated author records imported from booksellers BWB (especially) and Amazon, for a wide variety of reasons, this record actually has author and translator listed separately: https://www.amazon.com/gp/product/975072545X where it says "by Rachel Kushner (Author), Suat Ertüzün (Translator)," yet they were imported munged together.
Since this is a nicely specific example, hopefully it will be easy to fix.
Reproducing the bug
- Go to the link above
- Do ...
- Expected behavior: Each author record represents a single author without translators, illustrators, etc mixed in
- Actual behavior: Thousands and thousands of conflated records constantly being created
Context
- Browser (Chrome, Safari, Firefox, etc): not relevant
- OS (Windows, Mac, etc):
- Logged in (Y/N):
- Environment (prod, dev, local): prod
Breakdown
Note: this may not be the easiest issue to work on because it isn't trivial to test, as actually running this code relies on running BookWorm/the affiliate server, which can take some work.
Before tackling this issue you'd at least want to understand how serialize() gets data from the Amazon Products API, and how it flows through to clean_amazon_metadata_for_load(), and how you can hardcode your own test data into this process to test the flow between the functions, even if you don't ultimately run BookWorm itself in the affiliate-server container.
Currently, the serialize() function from BookWorm/the affiliate server is returning the following information for https://www.amazon.com/gp/product/975072545X:
{'authors': [{'name': 'Rachel Kushner'}, {'name': 'Suat Ertüzün'}],
'cover': 'https://m.media-amazon.com/images/I/51gky1d3IWL._SL500_.jpg',
'edition_num': None,
'isbn_10': ['975072545X'],
'isbn_13': ['9789750725456'],
'number_of_pages': 440,
'physical_format': 'paperback',
'price': '$35.00',
'price_amt': 3500,
'product_group': 'Book',
'publish_date': 'Apr 01, 2015',
'publishers': ['Can Yayınları'],
'source_records': ['amazon:975072545X'],
'title': 'Kübadan Teleks',
'url': 'https://www.amazon.com/dp/975072545X/?tag='}
As noted above, both author and translator have been lumped together as authors.
However, the metadata that comes back from the Amazon Products API includes both, as shown by this excerpt of the by_line_info:
{'by_line_info': {'brand': None,
'contributors': [{'locale': 'en_US',
'name': 'Rachel Kushner',
'role': 'Author'},
{'locale': 'en_US',
'name': 'Suat Ertüzün',
'role': 'Translator'}],
We will need to update the serialize() function around line 271 or so in openlibrary/core/vendors.py to both stop importing translator roles as authors, and also create a new key for translators and extract them separately.
This will also require further changes to ensure this ends up in the correct format for import. This would likely be done in clean_amazon_metadata_for_load(), though it may involve other changes.
Specifically, the translator(s) should end up as contributors with a translator role. See import_contributor in https://github.com/internetarchive/openlibrary-client/blob/master/olclient/schemata/import.schema.json and https://openlibrary.org/books/OL24337004M.json (from https://openlibrary.org/books/OL24337004M/The_Odyssey_of_Homer)
As a test case, here's the full metadata that comes back from the Amazon API, which could be used for mocking a response from the API:
{'asin': '975072545X', 'browse_node_info': None, 'detail_page_url': 'https://www.amazon.com/dp/975072545X?tag=interneta
rchi-20&linkCode=ogi&th=1&psc=1', 'images': {'primary': {'large': {'height': 500, 'url': 'https://m.media-amazon.com/im
ages/I/51gky1d3IWL._SL500_.jpg', 'width': 321}, 'medium': None, 'small': None}, 'variants': None}, 'item_info': {'by_li
ne_info': {'brand': None, 'contributors': [{'locale': 'en_US', 'name': 'Rachel Kushner', 'role': 'Author'}, {'locale':
'en_US', 'name': 'Suat Ertüzün', 'role': 'Translator'}], 'manufacturer': {'display_value': 'Can Yayınları', 'label': 'M
anufacturer', 'locale': 'en_US'}}, 'classifications': {'binding': {'display_value': 'Paperback', 'label': 'Binding', 'l
ocale': 'en_US'}, 'product_group': {'display_value': 'Book', 'label': 'ProductGroup', 'locale': 'en_US'}}, 'content_inf
o': {'edition': None, 'languages': {'display_values': [{'display_value': 'Turkish', 'type': 'Published'}, {'display_val
ue': 'Turkish', 'type': 'Original Language'}, {'display_value': 'Turkish', 'type': 'Unknown'}], 'label': 'Language', 'l
ocale': 'en_US'}, 'pages_count': {'display_value': 440, 'label': 'NumberOfPages', 'locale': 'en_US'}, 'publication_date
': {'display_value': '2015-04-01T00:00:00Z', 'label': 'PublicationDate', 'locale': 'en_US'}}, 'content_rating': None, '
external_ids': None, 'features': None, 'manufacture_info': {'item_part_number': {'display_value': '1', 'label': 'PartNu
mber', 'locale': 'en_US'}, 'model': None, 'warranty': None}, 'product_info': {'color': None, 'is_adult_product': None,
'item_dimensions': {'height': {'display_value': 7.6771653465, 'label': 'Height', 'locale': 'en_US', 'unit': 'inches'},
'length': {'display_value': 0.393700787, 'label': 'Length', 'locale': 'en_US', 'unit': 'inches'}, 'weight': {'display_v
alue': 0.7495716908, 'label': 'Weight', 'locale': 'en_US', 'unit': 'pounds'}, 'width': {'display_value': 4.9212598375,
'label': 'Width', 'locale': 'en_US', 'unit': 'inches'}}, 'release_date': None, 'size': None, 'unit_count': None}, 'tech
nical_info': None, 'title': {'display_value': 'Kübadan Teleks', 'label': 'Title', 'locale': 'en_US'}, 'trade_in_info':
None}, 'offers': {'listings': [{'availability': None, 'condition': None, 'delivery_info': None, 'id': 'oDg8%2Fu%2BR%2FL
0WLzvFujN6xVbdeurVK9TOjcknlzQiZlCuRNpUE%2BqIJpTOjvUB2ZhJOrwvkyrZ%2FBQkPOUs9mYR6u4kbYl%2FK%2B4NUving6%2FRYRFFh5eqNUCw%2F
qk8%2F5ms3Jl%2FfP90CHh0Eaxlt1R9eXt%2FApgY%2BhG%2BSHueV%2F32lAWJ%2B2yH1ObOScrdwa3UJMG1AMHb', 'is_buy_box_winner': None,
'loyalty_points': None, 'merchant_info': None, 'price': {'amount': 35.0, 'currency': 'USD', 'display_amount': '$35.00',
'price_per_unit': None, 'savings': None}, 'program_eligibility': None, 'promotions': None, 'saving_basis': None, 'viol
ates_map': False}], 'summaries': None}, 'parent_asin': None, 'rental_offers': None, 'score': None, 'variation_attribute
s': None}
Requirements Checklist
- [ ] modify
serialize()inopenlibrary/core/vendors.pyto no longer treat translators as authors. - [ ] modify
clean_amazon_metadata_for_load()to handle translators ascontributors. - [ ] Write some tests. Rather than mock the entire process, it might make sense to chain together each part. E.g. feed the sample data above into
serializeand make sure you get the data you want (e.g. with the translator and author properly segmented), then when that works, use the output data fromserialize()as input in a separate test forclean_amazon_metadata_for_load. Then you can take whatever that gets, and just make sure it imports properly (see import docs, and we can probably call it a day at that point.
Related files
- https://github.com/internetarchive/openlibrary/blob/master/openlibrary/core/vendors.py
- https://github.com/internetarchive/openlibrary/blob/master/openlibrary/tests/core/test_vendors.py
Stakeholders
Instructions for Contributors
- Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.
relates to #3084 That references editors in the title, but the discussion has examples where Amazon also calls out translator and editor roles in the same way.
I didn't really highlight the biggest problem here which is that the two individuals that Amazon lists separately with individual roles are being imported as a single author. I've updated the title to better reflect what's going on.
I added some more details on how this issue might be approached, and included sample response data from the Amazon Products API with which to work. This is probably not a trivial issue to tackle and would make a fairly bad first issue.
Hi @scottbarnes I would like to try and tackle this one as well. I think it might be somewhat related to one of the last issues I worked on. Thank you kindly!
Thanks for offering to work on this, @DebbieSan! If you have any questions, please ask.
@scottbarnes will do :) thank you!
Hey @scottbarnes just as a follow up, I am currently working on this and should have a PR soon.
Thank you kindly and have a great new year ahead!
Note that, although the example has two different contribution roles, multiple contributors shouldn't get munged together even if they have the same role (or no role).