openlibrary icon indicating copy to clipboard operation
openlibrary copied to clipboard

Update Import API to strip `-` from ISBNs

Open RayBB opened this issue 9 months ago • 5 comments

Problem

I ran this code:

await fetch("/api/import", {
  method: "POST",
  body: JSON.stringify({
    identifiers: { isbn_13: ["978-3-943075-58-8"] },
    title: "UX Design Process",
    languages: ["eng"],
    publishers: ["Smashing Media AG"],
    publish_date: "January 2013",
    cover:
      "https://archive.smashing.media/assets/344dbf88-fdf9-42bb-adb4-46f01eedd629/1ccfff60-c629-4973-bf5c-2997266b2e5c/ux-design-process-cover-large-opt.png",
    links: [
      {
        url: "https://www.smashingmagazine.com/ebooks/ux-design-process/",
        title: "homepage",
      },
    ],
    physical_format: "ePUB, Kindle, PDF",
    authors: [{ name: "Smashing Media AG" }],
    number_of_pages: "88",
    source_records: ["raybb_web_scraping"],
  }),
});

However, when I looked at the book it had the ISBN set to "978-3-943075-58-8" and when I visited https://openlibrary.org/isbn/978-3-943075-58-8 it didn't show that the book was available. So then I went and edited the book to add the ISBN again and it auto stripped the - and it started showing up at that endpoint.

So, I think we should probably strip the ISBNs before saving from the import endpoint.

Edit: I now see that even when I import without the - it still doesn't show up until after I edit.

Evidence / Screenshot

Relevant URL(s)

Reproducing the bug

  1. Go to ...
  2. Do ...
  • Expected behavior:
  • Actual behavior:

Context

No response

Notes from this Issue's Lead

Proposal & constraints

Related files

Stakeholders

RayBB avatar Apr 28 '24 18:04 RayBB

@RayBB, this one is puzzling me a bit. Are you able to reproduce this on the current master branch in the local development environment?

I just tried:

❯ curl -X POST \
     -H "Content-Type: application/json" \
     -b ~/cookies.txt \
     -d '{
        "title": "UX Design Process",
        "source_records": ["smashingmagazine:isbn_here_but_omitted_to_ensure_test_only_looks_at_isbn_field"],
        "authors": [{"name": "Smashing Media AG"}],
        "publishers": ["Smashing Media AG"],
        "isbn_13": ["978-3-943075-58-8"],
        "publish_date": "January 2013",
        "number_of_pages": 88
     }' \
     http://localhost:8080/api/import
{"authors": [{"key": "/authors/OL15A", "name": "Smashing Media AG", "status": "created"}], "success": true, "edition": {"key": "/books/OL28M", "status": "created"}, "work": {"key": "/works/OL20W", "status": "created"}}

I could then visit http://localhost:8080/isbn/978-3-943075-58-8 and I was forwarded to http://localhost:8080/books/OL28M/UX_Design_Process, which had "isbn_13": ["9783943075588"] in the JSON.

I looked at the code for /api/import, and it looks as if it has been running canonical() from isbnlib for quite some time.

On the day this issue was opened there was a commit (https://github.com/internetarchive/openlibrary/commit/5f7d8d190e2f0d837545e582fd5db99aae51a979) that changed some of the code for /isbn, but canonical() was already running on the ISBN input there too.

I have noticed sometimes it can take a little bit for the backend to 'catch up' when a new edition is added, in terms of the search. If you can't replicate this on the local environment, do you think this 'lag' could have been part of it, and maybe the editing and then having the book forward was a coincidence?

scottbarnes avatar May 11 '24 02:05 scottbarnes

@scottbarnes I didn't try with curl (I don't have the cookies file) But I did run the fetch in my main post on local in the browser and:

  1. The isbn url still 404s (http://localhost:8080/isbn/978-3-943075-58-8)
  2. The work page has the isbn with the - in it still

However, this really doesn't seem like a big issue because on prod is it probably due to some lag in processing and/or cache.

image

RayBB avatar May 13 '24 11:05 RayBB

@RayBB, I keep coming back to this, confused, but I think I finally understand what's happening. isbn_13 is its own field and doesn't go under identifiers when importing, even though when editing a book in the UI ISBN 10 and ISBN 13 are found under ID Numbers, which in many cases updates identifiers.

In the edition schema, one can see this a bit more clearly.

I think this can probably be closed now, but I want to double check with you.

scottbarnes avatar May 21 '24 16:05 scottbarnes

@scottbarnes I think the question now is do we expect that field to be stripped of -? In the edition edit UI it auto strips them and I suppose that was for a reason. However, my original problem is resolved so if the above isn't wroth looking into then yes lets close it!

RayBB avatar May 21 '24 16:05 RayBB

We may be talking past each other slightly, so I will step back a bit, but in doing so I may just be repeating things you already know, and I realize I may simply be misunderstanding the question, so my apologies if that's the case.

I think the hyphens weren't being stripped from the ISBN in this particular import because the ISBN was included under identifiers, with identifiers: { isbn_13: ["978-3-943075-58-8"] }; but the schema expects isbn_10 and isbn_13 to be a separate field from identifiers, such as isbn_13: ["978-3-943075-58-8"], even though the ISBNs display under the "ID Numbers" section of the UI.

When you edited the ISBN for UX Design, did you do anything to cause the ISBN to be moved to the correct field, or did that happen automatically? https://openlibrary.org/books/OL51673812M/UX_Design_Process?_compare=Compare&b=2&a=1&m=diff.

Though importing ISBNs under the identifiers field will display correctly in the UI, the logic of the site expects the ISBNs to be under isbn_10 and isbn_13 and not under identifiers.

scottbarnes avatar May 21 '24 16:05 scottbarnes

When you edited the ISBN for UX Design, did you do anything to cause the ISBN to be moved to the correct field, or did that happen automatically?

I don't recall for certain. If anything I just removed the identifier and added it back right away (because the client side strips the - when you re-add it).

Interesting that it expects it to be a separate field and not under identifiers. I guess the root of this is that my import statement wasn't following best practices.

RayBB avatar May 22 '24 10:05 RayBB

I will go ahead and close this, then. I do agree it's curious that ISBNs, which are definitely identifiers, are not included in identifiers. With the benefit of hindsight, I suspect this might have been done differently.

scottbarnes avatar May 22 '24 15:05 scottbarnes