Fix future import dates
Problem
Based on parsing the 2024-04-30 all types dump, there are at least 16k editions with publish_date fields that have future years.
The publish_date field should be removed from these editions.
Note, imports with such dates should no longer be possible
Stakeholders
@mekarpeles @judec
Instructions for Contributors
- Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to github, because pre-commit bot may add commits to your PRs upstream.
I'm not 100% sure, but I think the record which prompted this 'delete future dates' issue is https://openlibrary.org/books/OL35994446M/Yoga_Made_Easy
Which has 2537 editions that associate a real ISBN with an incorrect title and publication date, so something else is wrong there. Changing the date to a number other than 2050 but less than 2024 doesn't necessarily seem like the best fix.
The edition has now been fixed, but the work still with 2000+ editions is https://openlibrary.org/works/OL25455824W/Yoga_Made_Easy
It looks like other ISBNs are caught up in whatever is going on here: https://openlibrary.org/books/OL34227389M/Yoga_Made_Easy This is a book "All about horses"
@judec had just asked how many future publication dates there are, I saw that there are over 16k, and I offered to to remove the future dates. I take the point that simply removing future publication dates won't solve the underlying problem, but on the other hand neither did prohibiting future dates from importing, and this would at least make things consistent. I will hold off on this though so more people have a chance to opine.
Seems reasonable
I've been reviewing some of these future dates, there are thousands of bad 2039 imports under Enid Blyth
Others seem to have used 2039 as some kind of shorthand for 'printed this year', since they were imported they have made it on to other catalogs and the real publication date matches their import date.
Example: https://openlibrary.org/works/OL20923702W imported in 2020 with 2039 date.
Now in WorldCat with 2020 as the publication date.
I have been finding some future dates using Google Books API (automated), and manually by checking WorldCat in other cases.
I'm still playing around to get a feel for the problem, but I'm resolving many at a time.
Here's another consequence of just deleting suspect dates: https://openlibrary.org/books/OL38235686M/bbbbbbbbbbbbbb?v=2
This was just a junk record to begin with.
@mekarpeles @scottbarnes Definitely don't just delete future dates, we have many Hebrew calendar dates, current year 5784.
edit: ok, maybe not as many as I thought: https://openlibrary.org/search?q=publish_date%3A%5B5000+TO+5784%5D&mode=everything&sort=new
But we should still be careful :)
Is it possible some of these have been cleared already?
It's certainly possible someone has removed some future dates already, but I am not sure who may have done it or when.
For anyone else looking at this in the future, dates are a bit of a minefield. Although many countries that may use non-Gregorian calendars in some contexts (Japan, Korea, PRC, ROC, etc.) tend to have Gregorian dates in the metadata we receive, not all will.
For example, we have a small number that are likely using the Thai solar calendar (e.g. https://openlibrary.org/books/OL148810M).
"Fix" in the context of this title is not clear, I understand we may still want to make changes with future dates -- both ones we may encounter in the future and one's that are already in Open Library. Let's please open very specific, small/scoped issues re: what changes we want as currently this feels like an epic and we don't put epics on our milestones
TL;DR if we need progress on sub aspects of future dates, please creates targeted issues we can consider for our milestone
I have processed edition records from all years 2025 – 9999 with scripted checking of each year and removed the future dates from those weren't from one of the following calendars:
-
Vikram Samvat, 57 years ahead of Gregorian, on a number of books written in Nepalese and Tibetan (
/langauges/nepand/languages/tib) -
Thai Calendar, 543 years ahead of Gregorian, on some Thai language books (
/languages/thaand I think the same (similar?) Buddhist calendar is used for some Khmer language books/languages/khm) edit This is the family of Buddhist calendars which lists a few more countries that use it: https://en.wikipedia.org/wiki/Buddhist_calendar -
Hebrew Calendar current year 5784. These books are mainly in Hebrew,
/languages/heb, but there were a couple printed in the US in English on religious topics that used this calendar.
Using language codes was a relatively rough way to cross check and identify non-Gregorian calendars, but it's not 100% reliable to automate. It helped manually disambiguate clusters that fell in common publishing ranges from the different calendars.