openlibrary Bulk upload delegation request

I have a repository of +12K Arabic books metadata: https://github.com/avidseeker/arabooks

If there is a bot to mass-upload them that would be a great addition to OpenLibrary. It currently lacks a lot of Arabic books coverage. (There isn't even an Arabic translation for OpenLibrary: https://github.com/internetarchive/openlibrary/pull/9673)

Thanks in advance.

Edit: To complete this issue one would need to parse the TSV files found at https://github.com/avidseeker/arabooks and create JSONL files that look similar to this:

{"identifiers": {"open_textbook_library": ["1581"]}, "source_records": ["open_textbook_library:1581"], "title": "Legal Fundamentals of Healthcare Law", "languages": ["eng"], "subjects": ["Medicine", "Law"], "publishers": ["University of West Florida Pressbooks"], "publish_date": "2024", "authors": [{"name": "Tiffany Jackman"}], "lc_classifications": ["RA440", "KF385.A4"]}
{"identifiers": {"open_textbook_library": ["1580"]}, "source_records": ["open_textbook_library:1580"], "title": "Introduction to Literature: Fairy Tales, Folk Tales, and How They Shape Us", "languages": ["eng"], "subjects": ["Humanities", "Literature, Rhetoric, and Poetry"], "publishers": ["University of West Florida Pressbooks"], "publish_date": "2023", "authors": [{"name": "Judy Young"}], "lc_classifications": ["PE1408"]}

The minimum required fields are: title, authors, publish_date, source_records, and publish_date. The part of the source_records could come from the name of the source, and an identifier, such as for loal-en.tsv from the Library of Arabic Literature, it might be "source_records": ["loal:9781479834129"] for the first item in the list.

Here are the publishers of the TSV files:

awu-dam.tsv: Arabi Writers Union
lisanarb.tsv: contains a pub entry
loal-en.tsv and loal-ar.tsv: Library of Arabic Literature
shamela.tsv: contains a publisher entry. Dates need to be merged from shamela-dates.tsv matching same title entry.
waqfeya.tsv: set as "publishers": ["????"], since publishers need to be known on one by one basis.

Specifically, the values taken from the TSV and converted into JSONL would need to follow this schema. A script to do this for one line would look similar to this, but would probably use Python's csv module to read the TSV file, and then call json.dumps(line) on each line, after the data is in format specified in the import schema, and then it would be written to a JSONL file.

The output JSONL file could be tested using the endpoint from #8122, though you'd probably want to test with only a few records at a time rather than the whole file.

Aug 11 '24 05:08 avidseeker

@avidseeker, it would be great to increase the Arabic book coverage. There is an import API for sufficiently privileged patrons, which is documented here: https://github.com/internetarchive/openlibrary/wiki/Developer's-Guide-to-Data-Importing.

More specifically, there's a batch import endpoint (for which I just added the initial documentation), which would allow one to create a batch of records as JSONL for importing by a staff member: https://github.com/internetarchive/openlibrary/wiki/Developer's-Guide-to-Data-Importing#batch-importing-jsonl.

Currently this requires the field title, authors, publish_date, publishers, and source_records. I skimmed some of the records you have and notice the publisher tends to be missing, but perhaps that requirement can be relaxed for this import, though that's not a call I can make unilaterally. Hopefully we can get an answer to this sometime on Monday, Pacific time.

Also, what's the source of the records? That could help answer what the source_records field should be. It looks as if some may also have a page count? The full record schema can be found here: https://github.com/internetarchive/openlibrary-client/tree/master/olclient/schemata.

Aug 11 '24 14:08 scottbarnes

Thank you, that JSONL schema would definitely be helpful.

As for data sources, I updated the README of the repo to include their status. I updated shamela source, which is the biggest collection, to have the requested fields. I also updated LisanArab Library with URLs to book cover images . The data sources listed under completely-imported are ready to be used.

Aug 11 '24 20:08 avidseeker

@avidseeker do any of the original sources provide their bibliographic data in library MARC format? I had a brief look and could not find any.

Aug 11 '24 23:08 hornc

No. These libraries are very fragmented individual efforts, and many of gradually disappear, like Waqfeya.net, it has significantly less entries from just 2 years ago.

Aug 12 '24 00:08 avidseeker

(And to add to @scottbarnes ' answer, you basically need to coerce each book record into this format: https://github.com/internetarchive/openlibrary-client/blob/master/olclient/schemata/import.schema.json , and then save the results in a jsonl file :+1: )

Aug 12 '24 20:08 cdrini

I understand, I'll take a look on it, but I might not have the time for it.

I opened this issue in hopes of finding someone experienced with scripting and parsing that has done a bulk import before. They might already have conversion scripts to make data files comply with jsonl schema.

Aug 12 '24 20:08 avidseeker

@avidseeker, unless you suddenly have more free time and are itching to work on this, with your permission I will edit your initial comment in this issue to add something along the lines of the following, in the hope it makes it more attractive to a contributor who might wish to work on it:

To complete this issue one would need to parse the TSV files found at https://github.com/avidseeker/arabooks and create JSONL files that look similar to this:

{"identifiers": {"open_textbook_library": ["1581"]}, "source_records": ["open_textbook_library:1581"], "title": "Legal Fundamentals of Healthcare Law", "languages": ["eng"], "subjects": ["Medicine", "Law"], "publishers": ["University of West Florida Pressbooks"], "publish_date": "2024", "authors": [{"name": "Tiffany Jackman"}], "lc_classifications": ["RA440", "KF385.A4"]}
{"identifiers": {"open_textbook_library": ["1580"]}, "source_records": ["open_textbook_library:1580"], "title": "Introduction to Literature: Fairy Tales, Folk Tales, and How They Shape Us", "languages": ["eng"], "subjects": ["Humanities", "Literature, Rhetoric, and Poetry"], "publishers": ["University of West Florida Pressbooks"], "publish_date": "2023", "authors": [{"name": "Judy Young"}], "lc_classifications": ["PE1408"]}

The minimum required fields are: title, authors, publish_date, source_records, and publish_date. The part of the source_records could come from the name of the source, and an identifier, such as for loal-en.tsv from the Library of Arabic Literature, it might be "source_records": ["loal:9781479834129"] for the first item in the list. Many or perhaps all of the items in the TSVs won't have publishers listed, so to get around the import schema requirements for that schema, for now just have the value set to "publishers": ["????"] for now and we can cross that bridge later.

Specifically, they the values taken from the TSV and converted into JSONL would need to follow this schema. A script to do this for one line would look similar to this, but would probably use Python's csv module to read the TSV file, and then call json.dumps(line) on each line, after the data is in format specified in the import schema, and then it would be written to a JSONL file.

The output JSONL file could be tested using the endpoint from #8122, though you'd probably want to test with only a few records at a time rather than the whole file.

I think tasks like this are fun and I'm happy to help anyone interested in it.

Aug 16 '24 14:08 scottbarnes

Done. Thank you for breaking it down into steps. I added more clarification for the publishers' part.

Aug 20 '24 02:08 avidseeker

Great, thanks, @avidseeker!

Aug 20 '24 13:08 scottbarnes

Hello @scottbarnes & @mekarpeles!

I’d like to take on the task of converting the Arabic book metadata from the arabooks repository into the JSONL format required for OpenLibrary bulk upload. Here's my approach:

->Understand the Schema: Review OpenLibrary’s JSONL schema. -> Parse and Convert: Write a Python script to parse the TSV files and convert them into JSONL format, filling missing data as needed. -> Validate Output: Test the output files with the batch import endpoint and ensure they meet OpenLibrary's standards. -> Error Handling: Address any edge cases and document assumptions.

With my experience in Python and data formatting, I’m confident I can handle this task efficiently. May I be assigned to this issue?

Thank you!

Dec 18 '24 14:12 SharkyBytes

I assigned this to you, @SharkyBytes. The steps you've outlined sound correct to me. Please ask if you have any questions.

Dec 18 '24 17:12 scottbarnes

Hi @scottbarnes @mekarpeles & @avidseeker ,

I am only covering the minimum required fields: title, authors, publish_date, source_records, and publish_date. But I’m having some confusion regarding the lisanarb.tsv file and would appreciate your input.

-> What should we write in the source field? The columns in the lisanarb.tsv file are: title, author, editor, pub, pubplace, date, edition, vols, pages, url, ia_url, and cover_urls.

Should the source field be formatted as "lisanarb:<unique identifier like title or URL>", or do you have a specific format in mind?

-> Handling Arabic text encoding: In files containing Arabic, I see entries like the following while I converted into the JSONL file:

"title": "\u0627\u0644\u0642\u0631\u0627\u0631\u0627\u062a \u0627\u0644\u0646\u062d\u0648\u064a\u0629 \u0648\u0627\u0644\u062a\u0635\u0631\u064a\u0641\u064a\u0629 \u0644\u0645\u062c\u0645\u0639 \u0627\u0644\u0644\u063a\u0629 \u0627\u0644\u0639\u0631\u0628\u064a\u0629 \u0628\u0627\u0644\u0642\u0627\u0647\u0631\u0629", "author": "\u0627\u0644\u0623\u0633\u062a\u0627\u0630 \u0627\u0644\u062f\u0643\u062a\u0648\u0631 \u062e\u0627\u0644\u062f \u0628\u0646 \u0633\u0639\u0648\u062f \u0627\u0644\u0639\u0635\u064a\u0645\u064a"

Should I decode this into readable Arabic (e.g., using json.loads and ensuring it stays in UTF-8), or is it fine to leave it as-is for the JSONL file?

Let me know your thoughts on this!

Dec 23 '24 07:12 SharkyBytes

Readable Arabic (Unicode/UTF-8) would be ideal where possible.

Dec 23 '24 10:12 hornc

And for this?

-> What should we write in the source field? The columns in the lisanarb.tsv file are: title, author, editor, pub, pubplace, date, edition, vols, pages, url, ia_url, and cover_urls.

Should the source field be formatted as "lisanarb:", or do you have a specific format in mind?

Dec 23 '24 12:12 SharkyBytes

For source_records, let's just go with lisanarb: for now. I too have wondered if there is a specific format we should be following when adding these, but I think there may not be.

Dec 23 '24 14:12 scottbarnes

As no PR is linked and noone is active here, can I start for this?

Jan 30 '25 18:01 PredictiveManish

Hey @PredictiveManish , I’m already connected with @scottbarnes on Slack regarding this and am currently working on it. I’ll drop a message if I’m unable to solve it, or Scott may reassign it to you later. Thanks!

Jan 30 '25 20:01 SharkyBytes