openlibrary icon indicating copy to clipboard operation
openlibrary copied to clipboard

Import Wikisource trusted book provider data

Open pidgezero-one opened this issue 1 year ago • 14 comments

Problem

Followup to https://github.com/internetarchive/openlibrary/issues/8545. Currently there are 60 books in OL that have Wikisource IDs. IDs are formatted as langcode:title (i.e. en:George_Bernard_Shaw). Import Wikisource works into Open Library.

https://github.com/internetarchive/openlibrary/wiki/Developer's-Guide-to-Data-Importing

Breakdown

  • [x] @pidgezero-one Create a script which implements proposal below to get WikiSource data and coerce the data into Open Library's import format
  • [x] @cdrini Verify a ~10 sample of the resulting books
  • [x] Manually verify the extracted books
  • [ ] Both: Run bulk import

Proposal & Constraints

Hit English Wikisource's API and paginate through result sets of hits that fall under the "Validated texts" category: https://en.wikisource.org/w/api.php?action=query&generator=categorymembers&gcmtitle=Category:Validated_texts&gcmlimit=500&prop=categories|info|revisions&rvprop=content&rvslots=main&format=json&cllimit=max

The response includes documents that aren't books. Books are not flagged with a distinct category. We may have to also browse Wikisource's API to manually draft a list of categories that we should ignore any member of, such as Subpages (individual chapters of books), Posters, Songs, etc.

The response includes most of the work's metadata (author, year, etc) as wiki markup of the page's infobox. Consider using a library like wptools to parse it.

In the future, we will want to expand this import to support other languages besides en.wikisource.org, and perhaps expand beyond the Validated texts category, so the solution to this should be extensible.

A potential downside to how we derive Wikisource IDs is that we use the page title, which is modifiable, instead of the page's canonical ID, and that leaves us at the mercy of Wikisource's works being moved or having their names changed. This will likely be a pretty rare (if ever) occurrence, but if we ever decide we want to use canonical IDs instead, any wikisource item can be obtained with curid. Example: https://en.wikisource.org/?curid=4496925 and https://en.wikisource.org/wiki/%22Red%22_Fed._Memoirs are the same page. (This is also an example of a page whose title may need to be URLencoded in outbound links.)

Leads

Stakeholders

@cdrini @pidgezero-one


pidgezero-one avatar Aug 01 '24 01:08 pidgezero-one

I have a few questions about this feature,

  • it's not completely clear to me whether the en:George_Bernard_Shaw id is really a portable identifier or a really a URL equivalent (wiki + page title, which can change), or how it can be used to compare with other data sources that might list a 'Wikisource identifier'. The numeric ids look more like identifiers, but also are language wikisource specific, so there really isn't a single 'Wikisource identifier' 112842 is the 'George Bernard Shaw' book on en-wikisource, but it's something completely different on Ukrainian Wikisource.
  • Determining what is a 'book' on Wikisource does seem complicated, and it's not stated clearly. Pages on Wikisource appear to represent 'Works', but are generally expected to have a source published Edition -- I don't know if the edition can be changed in principle? I think that means Wikisource is not a publisher, so Wikisource will not be the only source for these books.
  • From the examples I have seen, the Wikisource scans appear to originate on archive.org, which should imply there are already Open Library records. The example https://en.wikisource.org/wiki/George_Bernard_Shaw has an archive.org id of https://archive.org/details/cu31924013547645 , which is in the metadata. A Ukrainian example https://uk.wikisource.org/wiki/%D0%A4%D0%B0%D0%B2%D1%81%D1%82 doesn't appear to mention or link to archive.org at all, but the scan that is on Wikisource seems to be this one from IA: https://archive.org/details/favsttragediia01goet If there is a built-in workflow relationship with Wikisource and archive.org already, there might be a more direct way to close the loop and associate the identifiers?

Knowing whether this the main value of this feature is to:

  • get more books into Open Library that OL does not have
  • close the loop on associating Wikisource pages with already existing records in OL
  • supporting some other Wikisource related workflow

would possibly help focus effort.

Some Wikisource texts appear to come from Project Gutenberg texts, and that makes me worry about some of the lack-of-provenance issues such PD texts might have. I'm not 100% sure how we do handle Project Gutenberg texts on OL, are they their own editions, do they change over time? That's probably a different topic though.

hornc avatar Nov 27 '24 23:11 hornc

  • it's not completely clear to me whether the en:George_Bernard_Shaw id is really a portable identifier or a really a URL equivalent (wiki + page title, which can change), or how it can be used to compare with other data sources that might list a 'Wikisource identifier'. The numeric ids look more like identifiers, but also are language wikisource specific, so there really isn't a single 'Wikisource identifier' 112842 is the 'George Bernard Shaw' book on en-wikisource, but it's something completely different on Ukrainian Wikisource.

I don't love the lang:title identifier format, personally. In the script in my open PR, I originally tried to use the numeric ID like the one you've identified. I stuck with lang:title here for two reasons: less so, I couldn't get the numeric identifier to resolve to the outbound links in the download options section for Wikisource books, and more so, it's already the identifier format that the small selection of existing Wikisource books in OL use (same example).

Determining what is a 'book' on Wikisource does seem complicated, and it's not stated clearly. Pages on Wikisource appear to represent 'Works', but are generally expected to have a source published Edition -- I don't know if the edition can be changed in principle? I think that means Wikisource is not a publisher, so Wikisource will not be the only source for these books.

Wikisource/Wikidata not explicitly differentiating what counts as a "book" has been a real thorn in my side. For example, I don't know if this counts as a book for the purposes of OL, but its Wikidata properties don't really distinguish it apart from what would be considered a book.

get more books into Open Library that OL does not have

This was my understanding of the main purpose here when Drini and I were first discussing the project. I've been writing the import record script with the understanding that we'd like to import items from more Wikisource language bases than just English in the future.

pidgezero-one avatar Nov 28 '24 00:11 pidgezero-one

I share @hornc 's concerns and would like to see this much more tightly specified.

Import Wikisource works into Open Library.

is a pretty terse description of request which could take a variety of different forms.

Wikisource is mostly made up of transcriptions of specific editions (not works), although, as @hornc points out PG editions are a bit of a wild card because they are editions without any provenance information which are intentionally unassociated with existing editions.

Is the intention to create new digital editions for the transcriptions which are derived from the original edition? Or is the intention just to make the transcription some type of digital proxy for the original edition? Wikisource, as with most things wiki*, seems a bit ambiguous, but seems to lean towards the latter model (ie they include a link to the Wikidata entity for the transcribed edition, but don't model the transcription separately).

Complicating this is the fact that Wikidata is generally poor at modeling book metadata. It's not a huge deal because it doesn't have much it it, but some of the logical conflicts you'll see include:

  • works with ISBNs (which can only be associated with editions)
  • works with OL edition identifiers (often added by bots based on the ISBNs above)
  • works/editions with both OL work and edition identifiers (which is never legal)

Using or linking to one of these conflated entities extends the mess because the new connection usually requires (or implies) either an edition or a work, but not both.

I would suggest that Wikisource transcriptions should actually be modeled independently from the editions that they transcribe, but that would require the buy-in/support of both the Wikisource and Wikidata communities. Certainly if OL considers exactly digital facsimiles from CreateSpace, etc, to be separate editions and transcription would definitely be considered a separate edition (but OpenLibrary's data model isn't rich enough to connect the two derived editions together, as far as I know).

Has anyone looked at how many of the transcribed editions are NOT already in OpenLibrary? My assumption is that the vast majority of them are, so perhaps focusing on @hornc 's suggestion of closing the loop on IA/OL editions would be a good place to start.

For example, I don't know if this counts as a book for the purposes of OL, but its Wikidata properties don't really distinguish it apart from what would be considered a book.

I would consider it a transcribed derivative of OL23268596M / ia:addresstomaryade00scot which was authored by Q16944048 (no associated OL ID in Wikidata, but appears to be OL6627737A). Given that OL & IA each have (separate) catalog records with the metadata and IA has scanned page images as well as OCR'd text, which is expected to be derived from Wikisource? Just a link or an alternative text version or some set of metadata or ... ? It might be tempting to infer equivalence of author IDs, but that seems risky absent other evidence than cooccurrence.

tfmorris avatar Nov 28 '24 23:11 tfmorris

I will leave this open as an epic targeting this work, to be closed when we do run a bulk import step. Currently the code has been merged and is undergoing some final verification/checks before deciding on next steps.

I will read through some of the comments/concerns raised here later today and respond; @pidgezero-one and I have largely discussed many of these concerns already :)

cdrini avatar Dec 05 '24 17:12 cdrini

On the identifier form, en:George_Bernard_Shaw: I understand the concerns, but I think you've reached the same conclusion that I reached, which is that this is the only identifier-like format that works across language-specific wikisource pages. It is also officially supported by their URL schemes: eg https://wikisource.org/wiki/en:The_Annotated_Strange_Case_of_Dr_Jekyll_and_Mr_Hyde . This decision was made in #8545 .

On "What is a book on WikiSource": This is a big concern; +1 @pidgezero-one 's response. Her extensive work in #9674 specifically targeting this problem is we believe sufficient at filtering out non-book items, like letters, press releases, decrees, etc, but the approach can always be improved. Regardless, the next step of this process (added a checkbox above) is to go through a random sample of the extracted books and manually verify the error rate.

On new editions/works/publishers: These were concerns that others had raised as well, and I raised them up during our community call a few months ago to get more voices in the discussion. There we landed on creating separate editions for them, with the publisher set to "WikiSource" as well as the original publisher. This is inline with how we treat Project Gutenberg and Standard Ebooks. WikiSource is subtly more grey area than Project Gutenberg and Standard Ebooks, but I think the work required to create a WikiSource book, coupled with the difference between the original and the WikiSource book, is sufficient in warranting its new edition record. +1 @tfmorris suggestion that in some future we would be able to link this editions as "derived".

On motivations of this feature: The motivation is two fold: (1) have more good quality books in Open Library that people can read in a wide variety of formats. WikiSource books have highly accurate EPUBs, PDFs, etc, which are great for readers looking to read on their phones or ereaders, and which are better than eg IA's auto-OCR'd books of the same formats. (2) Support a mission-aligned website doing great work in the book space. I love being able to drive traffic to WikiSource since they're a great project :) And it's open, so Open Library contributors can also go contribute on WikiSource if they so wish!

cdrini avatar Dec 05 '24 18:12 cdrini

" we landed on creating separate editions for them"

That wasn't specified in the feature / issue description, so I wasn't particularly reviewing with that idea in mind, and it's not clear to me that the implementation that was merged does that either. It's also not clear that that is the best provider of value for the intended feature. Where and when was the appropriate time to comment on that? I think the development showed that it's a bit more complicated than that, so not only was the decision not documented, it was not really clarified, so it's still unclear what the implementation needs to do in all the likely encountered cases.

It seems like whether an new edition is created it'll depend on the details of the existing matching process and the input data provided. I think most relevantly on what publication date is supplied. In the absence of any concrete examples of what was expected on a import, or what happens on an import with the current script, it is hard to evaluate on a simply technical mechanical level.

Without a clear usecase that deals with a specific value to an Open Library patron leveraging Open Library and Wikisource, it is hard to evaluate whether the code satisfies the feature.

I don't think this is particularly fair to developers or reviewers.

Another potential comment I might have made on #9674 is that the feature looks like it could be implemented as an external script which makes use of existing import endpoints (JSON, or the new /bulk/ submission endpoint) -- The current implementation adds external modules to requirements.txt which will be installed on production and every development Docker container. Where and when should I raise those as potential things to think about? (module version maintenance overhead, security footprint concerns (recently a relevant issue!), and container bloat) I don't have a huge problem with this specific case, but I don't think it's sustainable to have every OL related script stored in /scripts/ expanding the production requirements for scripts that may or may not run in production. Given the lack of high level description on the feature, maybe I'm missing some context and there is only one way to do this.

hornc avatar Dec 05 '24 23:12 hornc

@hornc I admit I'm a bit overloaded with all the issues at play currentlym though I have seen a few of your comments come up and wanted to offer some of my time during an upcoming books engineering call if I can be helpful listening and share what I know on some of these features (either so I can better understand and represent your perspective and also to add what I can re: what other perspective folks may have). Apologies that sometimes these things can be tougher over text when we don't often have a chance to meet synchronously for call!

Thank you both @hornc, @cdrini, @pidgezero-one, et al for trying to move things forward productively <3

mekarpeles avatar Jan 13 '25 00:01 mekarpeles

Ok! I think all the code has now been updated accordingly.

@hornc:

  • The creating a new edition is a tricky one... It's still one I flip on repeatedly! I'm not strongly convinced either way, I'm afraid! Apologies that you felt there wasn't a window of time to provide feedback on this point; it's difficult to know when/where to ask for that feedback. We asked for feedback during the community call with our head librarian and fell into agreement. But I do value your opinion on issues like these, so will try to remember to ping on GitHub to make sure folks here can also weight in.
  • @pidgezero-one responded to your feedback, and moved the wikisource-specific imports into a separate file :+1: For future reference, feel free to leave feedback like this on the PR! I think Wiki-related sites are pretty unique; I imagine very few other TBPs will require custom python packages for processing their data.
  • @pidgezero-one updated the import endpoint to handle these correctly to create a new edition; this also lays the foundations for unblocking #9372 , which is blocked on a similar problem of edition mismatch.
  • As per use case, I feel I defined that pretty clearly in my previous comment on "motivation" point 1, I'm not too sure what to add there; maybe let me know if you have any questions about what I've written, or if anything is unclear.

That said, here's the current next plan of action! We audited and added more checks to avoid importing non-book items. I personally went through a random sample of 100 rows, and found 90% were valid, which is pretty good, but that would mean ~100 non-book items would be imported. So I went through the full 1000 and just manually verified them; I found 91 non-book items (mostly letters, addresses, excerpts from other books, etc), and so the remaining 898 are valid. Ideally we update the corresponding wikidata metadata for these 91, so that our instance of/subclass of filters will catch them.

The plan is to do a set of 10 today to kick the tires, share results, and next Thursday kick off the full import! Assuming no show stoppers :)

cdrini avatar May 29 '25 15:05 cdrini

One more note to add! In addition to moving WS parser libraries into a standalone file to only be used during the import record generation process, I also moved the name-parsing logic to only apply to authors and contributors whose names are present in unstructured Wikisource infoboxes where the corresponding Wikidata item has incomplete information (because as Charles pointed out, Wikidata's information for author entities is already structured and formatted well and doesn't need to go through additional processing). I didn't see this happen in a ton of records, so it's just there to account for that edge case.

pidgezero-one avatar May 29 '25 15:05 pidgezero-one

And here is the 10 sample: https://openlibrary.org/import/batch/1505

Oh, those MODIFIED, are because the two PRs last week haven't been deployed yet :P Will do another ten after the deploy today goes out.

cdrini avatar May 29 '25 16:05 cdrini

Deploy was a bit delayed, kicking off the sample batch now: https://openlibrary.org/import/batch/1507

Notes:

  • ✅ It is correctly creating new editions
  • ✅ It is correctly not creating works (unless there are differences in title/etc)
    • Most of the time; failed here: https://openlibrary.org/search?q=Hints+towards+peace+in+ceremonial+matters&mode=everything . But I reckon it's a bug not related to this code.
  • ✅ It is correctly not creating authors
  • ❌ It is not uploading covers... eg https://openlibrary.org/books/OL59258268M/Castaway_on_the_Auckland_Isles
  • ❌ One bug: not handling the special SPARQL value for "unknown value" https://openlibrary.org/books/OL59258274M/A_Letter_on_Pauperism_and_Crime . I checked the validated dump though, and there are no other occurrences of this, so not a blocker.

Overall: No show-stoppers! And working quite well :) Very pleased/impressed that it didn't create a single new author!! Woohoo!! (Well except for the bizarre "unknown value" one :P

On deck for next Thursday :+1:

cdrini avatar Jun 05 '25 16:06 cdrini

It's great that you're doing test imports, checking the data quality, and looking for feedback. I hope that becomes standard practice for imports, with established minimums for data quality.

✅ It is correctly not creating works (unless there are differences in title/etc)

I don't understand this assertion when the very first item in that batch created a 15th (!) duplicate work for something which already had 14 duplicate work records. There are also already several records the 1904 edition which was imported. https://openlibrary.org/search?q=title%3A+%22Highways+and+Byways+in+Sussex%22&mode=everything

There have been a half dozen duplicate author records associated with this work alone created since I last merged 11 of them back in 2022 (in two separate batches), but, in total, there are DOZENS of duplicate author records across all his works. https://openlibrary.org/search/authors?q=e*+v*+lucas&mode=everything

[duplicate avoidance] failed here: https://openlibrary.org/search?q=Hints+towards+peace+in+ceremonial+matters&mode=everything . But I reckon it's a bug not related to this code.

That edition record is missing a publisher, which could be part of the problem, but I would expect publisher and publication date to be included in the minimum required metadata fields before even attempting an import.

❌ It is not uploading covers... eg https://openlibrary.org/books/OL59258268M/Castaway_on_the_Auckland_Isles

A transcription doesn't have a cover, does it? Any covers should presumably get added to the transcribed edition (and IA already has this cover image, but the title page was chosen instead).

One bug: not handling the special SPARQL value for "unknown value" https://openlibrary.org/books/OL59258274M/A_Letter_on_Pauperism_and_Crime . I checked the validated dump though, and there are no other occurrences of this, so not a blocker.

This also created a duplicate work. https://openlibrary.org/search?q=title%3A+%22Letter+on+Pauperism+and+Crime%22&mode=everything

In addition to "Unknown value," there is also "No value" and perhaps a few others.

Another bug not mentioned: none of the source record links work. Since they're rendered in a larger font than the actual link to Wikisource, they attract the eye first.

I've only reviewed the items that @cdrini highlighted, but will try to find time to review the rest of the best. Having said that, at first blush, it certainly seems like there is some additional work required.

tfmorris avatar Jun 05 '25 17:06 tfmorris

Looking at the imported https://openlibrary.org/books/OL59258268M/Castaway_on_the_Auckland_Isles record, the first thing I noticed was that the source record links don't work, as @tfmorris mentioned.

I think there was some discussion about using wikisource URLs as 'identifiers' when this feature was first discussed. Seeing an example record, I think:

  • the Source record should point to the Wikisource record, where Wikisource links to https://en.wikisource.org/wiki/Main_Page and record links to the specific record, matching the other sources like MARC records and IA records.
  • the Wikidata id should be under the Edition Identifiers, as it is an identifier

No need to duplicate links in both places, and this seems clearer. The value which functions as an identifier is stored as an idendifier, and the source URL is stored as a source, which sidesteps all the issues around Wikisource URLs not really being identifiers.

hornc avatar Jun 05 '25 22:06 hornc

  • On duplicate works/authors: I'm mainly checking for issues related to this specific case that might cause rampant duplicates being created. The second batch, where the import changes were deployed, showed no duplicate works or authors. Of course I expect there will be duplicates, but in line with our current baseline. Unfortunately improving that baseline I would consider out-of-scope of this issue. But I'm hoping that the added visibility proposed by #10782 will help on that front.
  • On covers: If a cover is specified in Wikisource itself, then it is meant to import. I consider the inclusion of the cover to be a editorial decision Wikisource makes, so if they include it, then I think we should. If they don't have one, in some ideal universe a screenshot of the Wikisource "title page" would be the ideal alternative. But there appears to be a bug preventing the cover url specified from being imported.
  • On publishers: Good catch! I'm not sure how prevalent that is in the data, but they will all have "Wikisource", which I think is sufficient.
  • On source record links: Good catch! Will create a separate issue to fix that; I don't think that's a blocker for importing. Also agree the design of these needs to reduce their visual weight :P
  • On source records and identifiers: My thinking has been as follows: identifiers point to equivalent book records in other catalogues/data sets. source records denote where data imported into this OL edition record came from. So in this case, because data was imported from both Wikisource and Wikidata, I think listing both as source records makes sense. However because the wikidata identifier actually identifies the original source book, not the Wikisource derivative edition, I do not think adding the wikidata ID to the new edition record is appropriate. I think it should be added to the corresponding source record in OL. The wikisource ID should be added.

That last point has been one of the pieces that's made it difficult for me to decide whether creating new editions is indeed the best path forward here!

cdrini avatar Jun 12 '25 16:06 cdrini

Thinking about this a little bit more and looking at some of the other examples in the test run, I've come to the conclusion that new editions should NEVER be created. It appears that the intended policy is to add Wikisource information to the edition that was used for transcription, if it can be found, in the same way that an edition can have multiple IA scans associated with it. This makes sense to me.

I will assert that ALL Wikisource transcriptions will be from an edition which has already been cataloged and that any failure to find the correct edition is due to missing/bad Wikisource metadata, a faulty matching algorithm, or both. Almost all the material in Wikisource is a) well known and b) sourced from Internet Archive in the first place.

tfmorris avatar Jun 24 '25 23:06 tfmorris