openlibrary
openlibrary copied to clipboard
Added logic for page dump and commented out test line
Closes #8401
This is a refactor that allows all dump file types that are NOT
"/type/edition",
"/type/author",
"/type/work",
"/type/redirect",
"/type/list"
to be sorted into a misc
category. This category catches all /type/page
dump files, in addition to all other types that are not in above list.
This misc files should help provide a comprehensive inventory of pages in the dump that is used to generate the sitemap.
Technical
I only tested these changes with a subset of full data (commented in line 38 of /scripts/oldump.sh) .
When line 38 was commented in, I also had to change the -z
in line 133 of /scripts/oldump.sh to -n
to avoid an error in /data/dump.py.
Testing
Screenshot
Ran docker compose run --rm home make test
Stakeholders
@jimchamp @RayBB
@merwhite11 Very excited for this and pleasantly surprised how simple the solution is :) Hope Jim can review it soon!
You may want to update the doc string here: https://github.com/internetarchive/openlibrary/blob/dda2d57e8c8bb65f7cc3e9e685175042618bec9d/openlibrary/data/dump.py#L212
@cdrini, blocking, please see https://github.com/internetarchive/openlibrary/issues/8401#issuecomment-2047735441
@cdrini, this is blocking @RayBB. Could you review this at the earliest?
@cdrini is there any way we can do a test run that doesn't upload to any items?
We want to avoid preferences, anything store related (which already shouldn't be there), PII (personally identifiable info), etc
Taking a look at this now; checking latest dumps to see what types of records we would get. This is currently running. (Written with some help from ChatGPT!)
python3 <<'EOF' | gzip > ol_dump_other.txt.gz
import gzip
import requests
import sys
url = 'https://openlibrary.org/data/ol_dump_latest.txt.gz'
exclude_types = {
"/type/edition",
"/type/author",
"/type/work",
"/type/redirect",
"/type/list"
}
with requests.get(url, stream=True) as r:
r.raise_for_status()
with gzip.GzipFile(fileobj=r.raw) as f:
for line in f:
line = line.decode('utf-8')
if line.split('\t', 1)[0] not in exclude_types:
print(line, end='')
EOF
As I mentioned two months ago https://github.com/internetarchive/openlibrary/issues/8401#issuecomment-2007353954 all this does is split the complete dump, which is already filtered. If there's a question about the contents, the complete dump creation is where it should be reviewed/fixed.
Yes, this is mostly about getting transparency into what's there. Here's the breakdown by type:
$ zcat ol_dump_other.txt.gz | cut -f1 | sort | uniq -c
1 /type/about
7 /type/backreference
3 /type/collection
2583653 /type/delete
5 /type/doc
11 /type/home
966 /type/i18n
34 /type/i18n_page
467 /type/language
324 /type/library
14 /type/local_id
126 /type/macro
5 /type/object
439 /type/page
12 /type/permission
1 /type/place
47 /type/rawtext
1 /type/scan_location
2 /type/scan_record
3 /type/series
91400 /type/subject
14 /type/tag
300 /type/template
48 /type/type
1 /type/uri
2 /type/user
19 /type/usergroup
107 /type/volume
This seems fine. Lots of ancient legacy stuff here :P This other
file is a bit problematic for long-term, since we might decide to split these out into separate dumps going forward, which would be a breaking change to this catch-all other dump file. I'm wondering if we might already want to separate out /type/delete
, since that's a substantial enough slice at this point. Thoughts? Let me know if you want the file; gzipped it's 78M, but if you're just curious what these things are you can hit up the query.json endpoint eg http://openlibrary.org/query.json?type=/type/user&*=
78 MB for "other" doesn't seem excessive when compared to the other files sizes. It's certainly a lot better than the 13.6 GB currently required to get any of the data that isn't broken out separately. One might even argue that editions, works, authors, and "other" would be an adequate breakdown. Reading log, redirects, lists, ratings, and everything else would still be less than 200 MB and less than half the size of the authors file.
File | Date | Size |
---|---|---|
ol_dump_2024-04-30.txt.gz | 02-May-2024 02:35 | 13.6G |
ol_dump_editions_2024-04-30.txt.gz | 02-May-2024 02:38 | 9.5G |
ol_dump_works_2024-04-30.txt.gz | 02-May-2024 02:39 | 3.0G |
ol_dump_authors_2024-04-30.txt.gz | 02-May-2024 02:37 | 579.4M |
ol_dump_reading-log_2024-04-30.txt.gz | 02-May-2024 02:36 | 73.5M |
ol_dump_redirects_2024-04-30.txt.gz | 02-May-2024 02:36 | 49.5M |
ol_dump_lists_2024-04-30.txt.gz | 02-May-2024 02:37 | 26.5M |
ol_dump_ratings_2024-04-30.txt.gz | 02-May-2024 02:36 | 4.6M |
I have no strong opinion on splitting. Just will be happy once it is easier to get access to these smaller sections of the dump.
Seems ok to put stuff in this other dump even if it could be moved out to another dump years down the line. Maybe it is slightly better for consumers to have just one to get with all this stuff rather than many small ones?
Oh agreed; I mean we could down the line decide to split some of these out; e.g. tags
is currently tiny, but if we at some point make a tag for every subject in our db, we'll likely split it out into its own dump, and then this file will have a breaking change. But I guess that's ok, we'll just have to make a public notification. @RayBB would you mind updating the docs to also link to the redirects dump? That'll make the "other" nature of this one clearer.
Errr actually I think we should split out the deletes now ; since we already have redirects in a separate, and the "other" dump will be like 95% deletes otherwise, which I think makes it less useful.
@cdrini I don't know how to make a https://openlibrary.org/data/ol_dump_redirects_latest.txt.gz like we have for https://openlibrary.org/data/ol_dump_ratings_latest.txt.gz
However, I updated the docs with placeholders :)
Also, I've gotten confused about this before but how can I find old ratings dumps? I don't see them in this url https://archive.org/details/ol_exports?tab=collection&query=ratings I tried searching IA for the file name but no luck.
Ah they're inside the dump, eg https://archive.org/download/ol_dump_2024-03-31 . Is that what you're looking for?
@RayBB I updated the endpoint to include the short link for redirects
and deletes
and other
:+1:
Ok this looks good to me! I've sent it to testing to test the new endpoints. @merwhite11 would you mind giving it another test to make sure my last changes didn't break anything? :P Then should be good to merge!
Confirmed new endpoints work on testing :+1:
Docs are updated with the forthcoming dump links https://openlibrary.org/developers/dumps
@merwhite11 tested and it correctly generated all the files :+1: Lgtm!