openlibrary icon indicating copy to clipboard operation
openlibrary copied to clipboard

Invalid Tables of Contents with json encoding

Open mekarpeles opened this issue 8 years ago • 13 comments

e.g. https://openlibrary.org/books/OL15978767M/This_strange_new_feeling

{'level': 0, 'label': '', 'pagenum': '', 'title': 'This strange, new feeling'} {'level': 0, 'label': '', 'pagenum': '', 'title': 'Where the sun lives'} {'level': 0, 'label': '', 'pagenum': '', 'title': 'A Christmas love story.'}

mekarpeles avatar Dec 20 '17 00:12 mekarpeles

Introduced by Workbot November 5, 2011: https://dev.openlibrary.org/books/OL15978767M/This_strange_new_feeling?b=9&a=8&_compare=Compare&m=diff

LeadSongDog avatar Dec 20 '17 01:12 LeadSongDog

@mekarpeles Here's another case: https://openlibrary.org/books/OL5582571M/Understanding_physics.?b=4&a=3&_compare=Compare&m=diff

LeadSongDog avatar Jan 23 '18 19:01 LeadSongDog

@mekarpeles One more, dated Oct 17, 2011: https://openlibrary.org/books/OL19395611M/Krokotiili?b=4&a=3&_compare=Compare&m=diff

LeadSongDog avatar Jun 29 '18 19:06 LeadSongDog

Here's the JSON for the bad rev (which is slightly different from the first example):

[
  {
    title: "{'level': 0, 'label': '', 'pagenum': '', 'title': 'Kroktiili'}",
    label: "",
    pagenum: "",
    level: 0
  },
  {
    title: "{'level': 0, 'label': '', 'pagenum': '', 'title': u'Ilke\xe4 tapaus'}",
    label: "",
    pagenum: "",
    level: 0
  },
  {
    title: "{'level': 0, 'label': '', 'pagenum': '', 'title': u'Heikko syd\xe4n.'}",
    label: "",
    pagenum: "",
    level: 0
  }
],

They should be pretty easy to find and fix. I can at least do a quick survey to see how prevalent the problem is. I doubt it's very common.

tfmorris avatar Jun 29 '18 23:06 tfmorris

I found a bunch of these in Dostoyevsky editions. One record was done twice: https://openlibrary.org/books/OL16388190M/The_eternal_husband_and_other_stories?b=6&a=5&_compare=Compare&m=diff and then: https://openlibrary.org/books/OL16388190M/The_eternal_husband_and_other_stories?b=8&a=7&_compare=Compare&m=diff

LeadSongDog avatar Jul 04 '18 13:07 LeadSongDog

@mekarpeles @tfmorris I've now seen a lot of these. They have a few things in common:

  1. They were all created by ImportBot (I thing in 2011)
  2. They all contain non-Latin characters in the table of contents.
  3. They were all correctly spelled in the MARC records and the pre-revision edition TOC
  4. The post-revision TOC escaped the non-Latin characters, so Ö became \xd6 e.g. https://openlibrary.org/books/OL14000990M/The_collected_works_of_Henrik_Ibsen?b=8&a=7&_compare=Compare&m=diff
  5. They can be rescued by copypasting the TOC from the MARC (record 505)

LeadSongDog avatar Jul 24 '18 21:07 LeadSongDog

I think part of the context/reason for this is that Bookreader is able to pull tables of contents from the corresponding OpenLibrary book json.

Either way, we should be displaying the tables of contents to users in a human readable way, e.g. like the google books ui: table-of-contents

(and not break bookreader's seek-to-chapter feature)

mekarpeles avatar Oct 05 '18 00:10 mekarpeles

@hornc, if you happen to know:

  1. Is there a way to look through the data dumps and see how many books are effected?
  2. And if so, is there a way to easily iterate over affected editions and fix them?
  3. Is there still an issue which could be causing these issues to happen?

mekarpeles avatar Dec 13 '19 01:12 mekarpeles

The problem with these records is that the toc_item's title should be a string, but instead is a JSON object (which almost duplicates the toc_item).

# command to list these from a data dump:
zgrep '"table_of_contents": \[[^\]]*"title": "{' ol_dump_editions_2024-03-31.txt.gz

There are 1364 editions with this problem in the ol_dump_editions_2024-03-31.txt.gz dump.

hornc avatar Sep 25 '24 20:09 hornc

Some of these errors have multiple levels of title nesting, this example has three levels of nesting: https://openlibrary.org/books/OL13734513M/Samlede_skuespil

hornc avatar Sep 25 '24 21:09 hornc

I have fixed these by pulling the title and pagenumber out of the inner JSON, and writing to the correct place.

There were also some very broken TOCs with single characters split into multiple entries. I couldn't see an obvious way to repair this class, so I simply removed them.

examples of fixes:

  • https://openlibrary.org/books/OL15139312M/The_nineteen_tragedies_and_fragments_of_Euripides?_compare=Compare&b=9&a=8&m=diff
  • https://openlibrary.org/books/OL249766M/Heinrich_Heine's_gesammelte_Werke?_compare=Compare&b=11&a=10&m=diff
  • https://openlibrary.org/books/OL14032575M/Orations_delivered_at_Harvard_College_1848-1853_i.e._1842-1853?_compare=Compare&b=6&a=5&m=diff

hornc avatar Sep 25 '24 23:09 hornc

Thanks for tackling this cleanup! I didn't do a comprehensive survey, but it looks like one of the works mentioned in the bug report was missed: https://openlibrary.org/books/OL14000990M/The_collected_works_of_Henrik_Ibsen?b=8&a=7&_compare=Compare&m=diff

Rather than attempting to repair the TOCs, I think it might be easier to just re-import them. For example, your (@hornc ) first example was corrupted multiple times during work merges, but the original MARC 505 is still available at https://openlibrary.org/show-records/marc_miami_univ_ohio/allbibs0116.out:2619524:1546

tfmorris avatar Sep 26 '24 08:09 tfmorris

@tfmorris thanks for spotting the missing records -- there was an error in my regex for finding them.

I've just found about 8000 more records that need TOC clean up!

I had an unnecessary backslash trying to escape a ] which didn't need it. The correct command to locate these records from a data dump is:

 zgrep '"table_of_contents": \[[^]]*"title": "{' ol_dump_editions_2024-03-31.txt.gz

I'll test how reimporting works, and whether they all have contents in the source data. I think I'd have to delete table_of_contents and re-import, and even then I'm not 100% sure TOCs get written when we match, I thought there was some level of selection on which fields can get added.

hornc avatar Sep 26 '24 21:09 hornc

The above regex picks up a lot of false positives (table of contents with { somewhere in the plain text)

zgrep "\"table_of_contents\": \[[^]]*\"title\": \"{'level" ol_dump_editions_2025-01-08.txt.gz

catches real examples that were missed the first time, e.g. https://openlibrary.org/books/OL2714871M/Five_Complete_Novels_of_Murder_and_Detection

hornc avatar Aug 18 '25 04:08 hornc

Fixed up remaining TOC issues located with the single quote regex

hornc avatar Aug 18 '25 05:08 hornc