openlibrary icon indicating copy to clipboard operation
openlibrary copied to clipboard

Amazon author + translator imported as single conflated author

Open tfmorris opened this issue 1 year ago • 6 comments

Problem

This author: https://openlibrary.org/authors/OL9912016A.json was imported from Amazon in Nov 2021 with the obviously conflated name of "Rachel Kushner Suat Ertuzun."

While there are thousands upon thousands of conflated author records imported from booksellers BWB (especially) and Amazon, for a wide variety of reasons, this record actually has author and translator listed separately: https://www.amazon.com/gp/product/975072545X where it says "by Rachel Kushner (Author), Suat Ertüzün (Translator)," yet they were imported munged together.

Since this is a nicely specific example, hopefully it will be easy to fix.

Reproducing the bug

  1. Go to the link above
  2. Do ...
  • Expected behavior: Each author record represents a single author without translators, illustrators, etc mixed in
  • Actual behavior: Thousands and thousands of conflated records constantly being created

Context

  • Browser (Chrome, Safari, Firefox, etc): not relevant
  • OS (Windows, Mac, etc):
  • Logged in (Y/N):
  • Environment (prod, dev, local): prod

Breakdown

Note: this may not be the easiest issue to work on because it isn't trivial to test, as actually running this code relies on running BookWorm/the affiliate server, which can take some work.

Before tackling this issue you'd at least want to understand how serialize() gets data from the Amazon Products API, and how it flows through to clean_amazon_metadata_for_load(), and how you can hardcode your own test data into this process to test the flow between the functions, even if you don't ultimately run BookWorm itself in the affiliate-server container.

Currently, the serialize() function from BookWorm/the affiliate server is returning the following information for https://www.amazon.com/gp/product/975072545X:

{'authors': [{'name': 'Rachel Kushner'}, {'name': 'Suat Ertüzün'}],
  'cover': 'https://m.media-amazon.com/images/I/51gky1d3IWL._SL500_.jpg',
  'edition_num': None,
  'isbn_10': ['975072545X'],
  'isbn_13': ['9789750725456'],
  'number_of_pages': 440,
  'physical_format': 'paperback',
  'price': '$35.00',
  'price_amt': 3500,
  'product_group': 'Book',
  'publish_date': 'Apr 01, 2015',
  'publishers': ['Can Yayınları'],
  'source_records': ['amazon:975072545X'],
  'title': 'Kübadan Teleks',
  'url': 'https://www.amazon.com/dp/975072545X/?tag='}

As noted above, both author and translator have been lumped together as authors.

However, the metadata that comes back from the Amazon Products API includes both, as shown by this excerpt of the by_line_info:

{'by_line_info': {'brand': None,
                  'contributors': [{'locale': 'en_US',
                                    'name': 'Rachel Kushner',
                                    'role': 'Author'},
                                   {'locale': 'en_US',
                                    'name': 'Suat Ertüzün',
                                    'role': 'Translator'}],

We will need to update the serialize() function around line 271 or so in openlibrary/core/vendors.py to both stop importing translator roles as authors, and also create a new key for translators and extract them separately.

This will also require further changes to ensure this ends up in the correct format for import. This would likely be done in clean_amazon_metadata_for_load(), though it may involve other changes.

Specifically, the translator(s) should end up as contributors with a translator role. See import_contributor in https://github.com/internetarchive/openlibrary-client/blob/master/olclient/schemata/import.schema.json and https://openlibrary.org/books/OL24337004M.json (from https://openlibrary.org/books/OL24337004M/The_Odyssey_of_Homer)

As a test case, here's the full metadata that comes back from the Amazon API, which could be used for mocking a response from the API:

{'asin': '975072545X', 'browse_node_info': None, 'detail_page_url': 'https://www.amazon.com/dp/975072545X?tag=interneta
rchi-20&linkCode=ogi&th=1&psc=1', 'images': {'primary': {'large': {'height': 500, 'url': 'https://m.media-amazon.com/im
ages/I/51gky1d3IWL._SL500_.jpg', 'width': 321}, 'medium': None, 'small': None}, 'variants': None}, 'item_info': {'by_li
ne_info': {'brand': None, 'contributors': [{'locale': 'en_US', 'name': 'Rachel Kushner', 'role': 'Author'}, {'locale': 
'en_US', 'name': 'Suat Ertüzün', 'role': 'Translator'}], 'manufacturer': {'display_value': 'Can Yayınları', 'label': 'M
anufacturer', 'locale': 'en_US'}}, 'classifications': {'binding': {'display_value': 'Paperback', 'label': 'Binding', 'l
ocale': 'en_US'}, 'product_group': {'display_value': 'Book', 'label': 'ProductGroup', 'locale': 'en_US'}}, 'content_inf
o': {'edition': None, 'languages': {'display_values': [{'display_value': 'Turkish', 'type': 'Published'}, {'display_val
ue': 'Turkish', 'type': 'Original Language'}, {'display_value': 'Turkish', 'type': 'Unknown'}], 'label': 'Language', 'l
ocale': 'en_US'}, 'pages_count': {'display_value': 440, 'label': 'NumberOfPages', 'locale': 'en_US'}, 'publication_date
': {'display_value': '2015-04-01T00:00:00Z', 'label': 'PublicationDate', 'locale': 'en_US'}}, 'content_rating': None, '
external_ids': None, 'features': None, 'manufacture_info': {'item_part_number': {'display_value': '1', 'label': 'PartNu
mber', 'locale': 'en_US'}, 'model': None, 'warranty': None}, 'product_info': {'color': None, 'is_adult_product': None, 
'item_dimensions': {'height': {'display_value': 7.6771653465, 'label': 'Height', 'locale': 'en_US', 'unit': 'inches'}, 
'length': {'display_value': 0.393700787, 'label': 'Length', 'locale': 'en_US', 'unit': 'inches'}, 'weight': {'display_v
alue': 0.7495716908, 'label': 'Weight', 'locale': 'en_US', 'unit': 'pounds'}, 'width': {'display_value': 4.9212598375, 
'label': 'Width', 'locale': 'en_US', 'unit': 'inches'}}, 'release_date': None, 'size': None, 'unit_count': None}, 'tech
nical_info': None, 'title': {'display_value': 'Kübadan Teleks', 'label': 'Title', 'locale': 'en_US'}, 'trade_in_info': 
None}, 'offers': {'listings': [{'availability': None, 'condition': None, 'delivery_info': None, 'id': 'oDg8%2Fu%2BR%2FL
0WLzvFujN6xVbdeurVK9TOjcknlzQiZlCuRNpUE%2BqIJpTOjvUB2ZhJOrwvkyrZ%2FBQkPOUs9mYR6u4kbYl%2FK%2B4NUving6%2FRYRFFh5eqNUCw%2F
qk8%2F5ms3Jl%2FfP90CHh0Eaxlt1R9eXt%2FApgY%2BhG%2BSHueV%2F32lAWJ%2B2yH1ObOScrdwa3UJMG1AMHb', 'is_buy_box_winner': None, 
'loyalty_points': None, 'merchant_info': None, 'price': {'amount': 35.0, 'currency': 'USD', 'display_amount': '$35.00',
 'price_per_unit': None, 'savings': None}, 'program_eligibility': None, 'promotions': None, 'saving_basis': None, 'viol
ates_map': False}], 'summaries': None}, 'parent_asin': None, 'rental_offers': None, 'score': None, 'variation_attribute
s': None}

Requirements Checklist

  • [ ] modify serialize() in openlibrary/core/vendors.py to no longer treat translators as authors.
  • [ ] modify clean_amazon_metadata_for_load() to handle translators as contributors.
  • [ ] Write some tests. Rather than mock the entire process, it might make sense to chain together each part. E.g. feed the sample data above into serialize and make sure you get the data you want (e.g. with the translator and author properly segmented), then when that works, use the output data from serialize() as input in a separate test for clean_amazon_metadata_for_load. Then you can take whatever that gets, and just make sure it imports properly (see import docs, and we can probably call it a day at that point.

Related files

  • https://github.com/internetarchive/openlibrary/blob/master/openlibrary/core/vendors.py
  • https://github.com/internetarchive/openlibrary/blob/master/openlibrary/tests/core/test_vendors.py

Stakeholders


Instructions for Contributors

  • Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.

tfmorris avatar Sep 17 '24 16:09 tfmorris

relates to #3084 That references editors in the title, but the discussion has examples where Amazon also calls out translator and editor roles in the same way.

hornc avatar Sep 18 '24 02:09 hornc

I didn't really highlight the biggest problem here which is that the two individuals that Amazon lists separately with individual roles are being imported as a single author. I've updated the title to better reflect what's going on.

tfmorris avatar Sep 18 '24 14:09 tfmorris

I added some more details on how this issue might be approached, and included sample response data from the Amazon Products API with which to work. This is probably not a trivial issue to tackle and would make a fairly bad first issue.

scottbarnes avatar Sep 30 '24 15:09 scottbarnes

Hi @scottbarnes I would like to try and tackle this one as well. I think it might be somewhat related to one of the last issues I worked on. Thank you kindly!

DebbieSan avatar Oct 01 '24 02:10 DebbieSan

Thanks for offering to work on this, @DebbieSan! If you have any questions, please ask.

scottbarnes avatar Oct 01 '24 14:10 scottbarnes

@scottbarnes will do :) thank you!

DebbieSan avatar Oct 01 '24 14:10 DebbieSan

Hey @scottbarnes just as a follow up, I am currently working on this and should have a PR soon.

Thank you kindly and have a great new year ahead!

DebbieSan avatar Dec 29 '24 00:12 DebbieSan

Note that, although the example has two different contribution roles, multiple contributors shouldn't get munged together even if they have the same role (or no role).

tfmorris avatar Dec 29 '24 01:12 tfmorris