openlibrary icon indicating copy to clipboard operation
openlibrary copied to clipboard

Basic ISBN hyphenation on edition view

Open hornc opened this issue 3 years ago • 6 comments

Test branch to demonstrate ISBN hyphenation already present in isbnlib.

Uses the already present isbnlib to hyphenate ISBNs on the edition view template.

Technical

Before merging, this will need to handle un-hyphenatable ISBNs (just return the raw string, but possibly notify that there is likely a problem with the ISBN).

Perhaps move it into a util / helper function.

Relates to #6634

A helper function which hyphenates would also be useful for #6516

Testing

An ISBN like number (with validating checkdigit) which cannot be hyphenated using the current (@ June 2022) official ISBN rangefiles is: 9786400042628

isbnlib.mask('9786400042628') returns None, not an exception.

Invalid check digit ISBNs :

>>> mask('9786400042621')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/isbnlib/_ext.py", line 20, in mask
    return msk(isbn, separator)
  File "/usr/local/lib/python3.6/dist-packages/isbnlib/_msk.py", line 28, in msk
    raise NotValidISBNError(isbn)
isbnlib._exceptions.NotValidISBNError: (9786400042621) is not a valid ISBN

This case will also need to be handled.

Screenshot

What the Book details page of an edition page looks like with this change:

image

(local URL: http://localhost:8080/books/OL40M )

Updated testing Showing invalid ISBNs left alone:

image

Stakeholders

@seabelis

hornc avatar Jun 15 '22 21:06 hornc

Do we have a sample list of invalid ISBNs along with some sense of how this code should handle them?

cclauss avatar Sep 10 '22 15:09 cclauss

Here's one published but 'bad' ISBN: 7204051613 (https://worldcat.org/isbn/7204051613)

In a separate thing I am working on, I used isbnlib to find 'bad' ISBNs in the Open Library editions dump, there are around 140k editions with bad ISBNs, but there are a number of identical bad ISBNs. I've attached the output, but here's a sampling of 'bad' ISBNs per isbnlib:

2022-09-06 19:27:07.266863: Invalid ISBNs for OL20889210M: ['050'] None
2022-09-06 19:27:07.294719: Invalid ISBNs for OL14334755M: ['915502175'] None
2022-09-06 19:27:07.295268: Invalid ISBNs for OL16280533M: ['9754148194'] None
2022-09-06 19:27:07.295616: Invalid ISBNs for OL22026037M: ['085480045'] None
2022-09-06 19:27:07.296608: Invalid ISBNs for OL22041640M: ['906193086'] None
2022-09-06 19:27:07.297331: Invalid ISBNs for OL16300049M: ['multi046'] None
2022-09-06 19:27:07.297885: Invalid ISBNs for OL21299995M: ['081210420'] None
2022-09-06 19:27:07.298976: Invalid ISBNs for OL16321861M: ['9990881344'] None
2022-09-06 19:27:07.299906: Invalid ISBNs for OL22102597M: ['0860941569'] None

Here's the lot of them: report_bad_isbns.txt

Rather than calling isbnlib.mask() directly, what about calling a function that attempts to hyphenate an ISBN, and if that doesn't work, simply returns the ISBN. E.g.:

import isbnlib
from isbnlib._exceptions import NotValidISBNError


def hyphenate_isbn(isbn: str) -> str:
    """Return a hypenated ISBN, or the input ISBN if it can't be hypenated"""
    try:
        return isbnlib.mask(isbn)
    except NotValidISBNError:
        return isbn

Output:

In [2]: hyphenate_isbn("multi046")  # invalid
Out[2]: 'multi046'
In [3]: hyphenate_isbn("915502175")  # invalid
Out[3]: '915502175'
In [4]: hyphenate_isbn("344203275X")  # valid
Out[4]: '3-442-03275-X'

scottbarnes avatar Sep 12 '22 02:09 scottbarnes

@xlcnd Do you have any advise for us on testing these ISBN data quality and hyphenation issues?

cclauss avatar Sep 12 '22 06:09 cclauss

There are a lot of 'bad' ISBNs in databases and catalogues, usually because of an error in the printing of a book, errors in applying the hyphenation rules, and even because a publisher goes rogue! Since the ISBN is used as an identifier for logistics and sales, the publishers normally don't correct the mistake.

So you need to be careful in correcting these mistakes.

xlcnd avatar Sep 12 '22 08:09 xlcnd

Thanks for the reviews all, I have used the invalid examples to add tests for a new util function.

I've imported ISBNLibException because that is the public interface, and handled the case Where isbnlib.mask() returns an empty string if the checkdigit and ISBN are otherwise valid, but no rules exist to hyphenate it , e.g.:

isbnlib.mask('9789999334525')
''

That's a made up example, I think 9786400042628 is a "real" example found in the wild. I don't really trust these ISBNs as they have been chosen from unallocated therefore unofficial blocks, and may be reallocated in future.

isbnlib won't hyphenate invalid checkdigit ISBNs, even if rules exist to otherwise hyphenate them. I've added a xfail test to make that behaviour explicit, if isbnlib changes, or Open Library wants to handle it differently.

hornc avatar Sep 13 '22 04:09 hornc

This was discussed today. The fixes the the catalog are absolutely appreciated though @seabelis, @cdrini, myself, and others feel that adding formatting to the ISBN presentation makes it more difficult for many librarians to work with. Goodreads, worldcat, and several other prominent platforms don't stylize ISBNs with hyphens.

Given these details, I'd appreciate if we can hold off on making the UI part of this change-set for now.

mekarpeles avatar Sep 20 '22 23:09 mekarpeles

Ok @mekarpeles this was meant to be an experiment to show what it would look like and prompt discussion, and demonstrate that the existing isbnlib tool used can do the job, so the effort to add the feature is not high.

It has shown isbnlib can do the task, and has prompted discussion, so that' s good :)

hornc avatar Sep 28 '22 19:09 hornc

Ok ty!

mekarpeles avatar Oct 04 '22 19:10 mekarpeles