openlibrary icon indicating copy to clipboard operation
openlibrary copied to clipboard

Fix future import dates

Open scottbarnes opened this issue 1 year ago • 7 comments

Problem

Based on parsing the 2024-04-30 all types dump, there are at least 16k editions with publish_date fields that have future years.

The publish_date field should be removed from these editions.

Note, imports with such dates should no longer be possible

Stakeholders

@mekarpeles @judec

Instructions for Contributors

  • Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to github, because pre-commit bot may add commits to your PRs upstream.

scottbarnes avatar Jun 13 '24 21:06 scottbarnes

I'm not 100% sure, but I think the record which prompted this 'delete future dates' issue is https://openlibrary.org/books/OL35994446M/Yoga_Made_Easy

Which has 2537 editions that associate a real ISBN with an incorrect title and publication date, so something else is wrong there. Changing the date to a number other than 2050 but less than 2024 doesn't necessarily seem like the best fix.

The edition has now been fixed, but the work still with 2000+ editions is https://openlibrary.org/works/OL25455824W/Yoga_Made_Easy

hornc avatar Jun 13 '24 23:06 hornc

It looks like other ISBNs are caught up in whatever is going on here: https://openlibrary.org/books/OL34227389M/Yoga_Made_Easy This is a book "All about horses"

hornc avatar Jun 13 '24 23:06 hornc

@judec had just asked how many future publication dates there are, I saw that there are over 16k, and I offered to to remove the future dates. I take the point that simply removing future publication dates won't solve the underlying problem, but on the other hand neither did prohibiting future dates from importing, and this would at least make things consistent. I will hold off on this though so more people have a chance to opine.

scottbarnes avatar Jun 14 '24 00:06 scottbarnes

Seems reasonable

mekarpeles avatar Jun 15 '24 17:06 mekarpeles

I've been reviewing some of these future dates, there are thousands of bad 2039 imports under Enid Blyth

Others seem to have used 2039 as some kind of shorthand for 'printed this year', since they were imported they have made it on to other catalogs and the real publication date matches their import date.

Example: https://openlibrary.org/works/OL20923702W imported in 2020 with 2039 date.

Now in WorldCat with 2020 as the publication date.

I have been finding some future dates using Google Books API (automated), and manually by checking WorldCat in other cases.

I'm still playing around to get a feel for the problem, but I'm resolving many at a time.

Here's another consequence of just deleting suspect dates: https://openlibrary.org/books/OL38235686M/bbbbbbbbbbbbbb?v=2

This was just a junk record to begin with.

hornc avatar Jun 18 '24 06:06 hornc

@mekarpeles @scottbarnes Definitely don't just delete future dates, we have many Hebrew calendar dates, current year 5784.

edit: ok, maybe not as many as I thought: https://openlibrary.org/search?q=publish_date%3A%5B5000+TO+5784%5D&mode=everything&sort=new

But we should still be careful :)

Is it possible some of these have been cleared already?

hornc avatar Jun 18 '24 06:06 hornc

It's certainly possible someone has removed some future dates already, but I am not sure who may have done it or when.

For anyone else looking at this in the future, dates are a bit of a minefield. Although many countries that may use non-Gregorian calendars in some contexts (Japan, Korea, PRC, ROC, etc.) tend to have Gregorian dates in the metadata we receive, not all will.

For example, we have a small number that are likely using the Thai solar calendar (e.g. https://openlibrary.org/books/OL148810M).

scottbarnes avatar Jun 18 '24 20:06 scottbarnes

"Fix" in the context of this title is not clear, I understand we may still want to make changes with future dates -- both ones we may encounter in the future and one's that are already in Open Library. Let's please open very specific, small/scoped issues re: what changes we want as currently this feels like an epic and we don't put epics on our milestones

TL;DR if we need progress on sub aspects of future dates, please creates targeted issues we can consider for our milestone

mekarpeles avatar Jul 01 '24 19:07 mekarpeles

I have processed edition records from all years 2025 – 9999 with scripted checking of each year and removed the future dates from those weren't from one of the following calendars:

  • Vikram Samvat, 57 years ahead of Gregorian, on a number of books written in Nepalese and Tibetan (/langauges/nep and /languages/tib)
  • Thai Calendar, 543 years ahead of Gregorian, on some Thai language books (/languages/tha and I think the same (similar?) Buddhist calendar is used for some Khmer language books /languages/khm ) edit This is the family of Buddhist calendars which lists a few more countries that use it: https://en.wikipedia.org/wiki/Buddhist_calendar
  • Hebrew Calendar current year 5784. These books are mainly in Hebrew, /languages/heb, but there were a couple printed in the US in English on religious topics that used this calendar.

Using language codes was a relatively rough way to cross check and identify non-Gregorian calendars, but it's not 100% reliable to automate. It helped manually disambiguate clusters that fell in common publishing ranges from the different calendars.

hornc avatar Aug 20 '24 02:08 hornc