openlibrary Added logic for page dump and commented out test line

Closes #8401

This is a refactor that allows all dump file types that are NOT

        "/type/edition",
        "/type/author",
        "/type/work",
        "/type/redirect",
        "/type/list"

to be sorted into a misc category. This category catches all /type/page dump files, in addition to all other types that are not in above list.

This misc files should help provide a comprehensive inventory of pages in the dump that is used to generate the sitemap.

Technical

I only tested these changes with a subset of full data (commented in line 38 of /scripts/oldump.sh) . When line 38 was commented in, I also had to change the -z in line 133 of /scripts/oldump.sh to -n to avoid an error in /data/dump.py.

Testing

Screenshot

Ran docker compose run --rm home make test Screenshot 2024-04-19 at 4 13 46 PM

Stakeholders

@jimchamp @RayBB

Apr 19 '24 23:04 merwhite11

@merwhite11 Very excited for this and pleasantly surprised how simple the solution is :) Hope Jim can review it soon!

You may want to update the doc string here: https://github.com/internetarchive/openlibrary/blob/dda2d57e8c8bb65f7cc3e9e685175042618bec9d/openlibrary/data/dump.py#L212

Apr 20 '24 20:04 RayBB

@cdrini, blocking, please see https://github.com/internetarchive/openlibrary/issues/8401#issuecomment-2047735441

Apr 29 '24 19:04 mekarpeles

@cdrini, this is blocking @RayBB. Could you review this at the earliest?

May 10 '24 17:05 jimchamp

@cdrini is there any way we can do a test run that doesn't upload to any items?

We want to avoid preferences, anything store related (which already shouldn't be there), PII (personally identifiable info), etc

May 20 '24 19:05 mekarpeles

Taking a look at this now; checking latest dumps to see what types of records we would get. This is currently running. (Written with some help from ChatGPT!)

python3 <<'EOF' | gzip > ol_dump_other.txt.gz
import gzip
import requests
import sys

url = 'https://openlibrary.org/data/ol_dump_latest.txt.gz'

exclude_types = {
    "/type/edition",
    "/type/author",
    "/type/work",
    "/type/redirect",
    "/type/list"
}

with requests.get(url, stream=True) as r:
    r.raise_for_status()
    with gzip.GzipFile(fileobj=r.raw) as f:
        for line in f:
            line = line.decode('utf-8')
            if line.split('\t', 1)[0] not in exclude_types:
                print(line, end='')

EOF

May 21 '24 15:05 cdrini

As I mentioned two months ago https://github.com/internetarchive/openlibrary/issues/8401#issuecomment-2007353954 all this does is split the complete dump, which is already filtered. If there's a question about the contents, the complete dump creation is where it should be reviewed/fixed.

May 21 '24 17:05 tfmorris

Yes, this is mostly about getting transparency into what's there. Here's the breakdown by type:

$ zcat ol_dump_other.txt.gz | cut -f1 | sort | uniq -c
      1 /type/about
      7 /type/backreference
      3 /type/collection
2583653 /type/delete
      5 /type/doc
     11 /type/home
    966 /type/i18n
     34 /type/i18n_page
    467 /type/language
    324 /type/library
     14 /type/local_id
    126 /type/macro
      5 /type/object
    439 /type/page
     12 /type/permission
      1 /type/place
     47 /type/rawtext
      1 /type/scan_location
      2 /type/scan_record
      3 /type/series
  91400 /type/subject
     14 /type/tag
    300 /type/template
     48 /type/type
      1 /type/uri
      2 /type/user
     19 /type/usergroup
    107 /type/volume

This seems fine. Lots of ancient legacy stuff here :P This other file is a bit problematic for long-term, since we might decide to split these out into separate dumps going forward, which would be a breaking change to this catch-all other dump file. I'm wondering if we might already want to separate out /type/delete, since that's a substantial enough slice at this point. Thoughts? Let me know if you want the file; gzipped it's 78M, but if you're just curious what these things are you can hit up the query.json endpoint eg http://openlibrary.org/query.json?type=/type/user&*=

May 21 '24 20:05 cdrini

78 MB for "other" doesn't seem excessive when compared to the other files sizes. It's certainly a lot better than the 13.6 GB currently required to get any of the data that isn't broken out separately. One might even argue that editions, works, authors, and "other" would be an adequate breakdown. Reading log, redirects, lists, ratings, and everything else would still be less than 200 MB and less than half the size of the authors file.

File	Date	Size
ol_dump_2024-04-30.txt.gz	02-May-2024 02:35	13.6G
ol_dump_editions_2024-04-30.txt.gz	02-May-2024 02:38	9.5G
ol_dump_works_2024-04-30.txt.gz	02-May-2024 02:39	3.0G
ol_dump_authors_2024-04-30.txt.gz	02-May-2024 02:37	579.4M
ol_dump_reading-log_2024-04-30.txt.gz	02-May-2024 02:36	73.5M
ol_dump_redirects_2024-04-30.txt.gz	02-May-2024 02:36	49.5M
ol_dump_lists_2024-04-30.txt.gz	02-May-2024 02:37	26.5M
ol_dump_ratings_2024-04-30.txt.gz	02-May-2024 02:36	4.6M

May 21 '24 21:05 tfmorris

I have no strong opinion on splitting. Just will be happy once it is easier to get access to these smaller sections of the dump.

Seems ok to put stuff in this other dump even if it could be moved out to another dump years down the line. Maybe it is slightly better for consumers to have just one to get with all this stuff rather than many small ones?

May 22 '24 09:05 RayBB

Oh agreed; I mean we could down the line decide to split some of these out; e.g. tags is currently tiny, but if we at some point make a tag for every subject in our db, we'll likely split it out into its own dump, and then this file will have a breaking change. But I guess that's ok, we'll just have to make a public notification. @RayBB would you mind updating the docs to also link to the redirects dump? That'll make the "other" nature of this one clearer.

May 22 '24 10:05 cdrini

Errr actually I think we should split out the deletes now ; since we already have redirects in a separate, and the "other" dump will be like 95% deletes otherwise, which I think makes it less useful.

May 22 '24 10:05 cdrini

@cdrini I don't know how to make a https://openlibrary.org/data/ol_dump_redirects_latest.txt.gz like we have for https://openlibrary.org/data/ol_dump_ratings_latest.txt.gz

However, I updated the docs with placeholders :)

Also, I've gotten confused about this before but how can I find old ratings dumps? I don't see them in this url https://archive.org/details/ol_exports?tab=collection&query=ratings I tried searching IA for the file name but no luck.

May 22 '24 10:05 RayBB

Ah they're inside the dump, eg https://archive.org/download/ol_dump_2024-03-31 . Is that what you're looking for?

May 22 '24 11:05 cdrini

@RayBB I updated the endpoint to include the short link for redirects and deletes and other :+1:

May 22 '24 11:05 cdrini

Ok this looks good to me! I've sent it to testing to test the new endpoints. @merwhite11 would you mind giving it another test to make sure my last changes didn't break anything? :P Then should be good to merge!

May 22 '24 11:05 cdrini

Confirmed new endpoints work on testing :+1:

May 22 '24 11:05 cdrini

Docs are updated with the forthcoming dump links https://openlibrary.org/developers/dumps

May 22 '24 11:05 RayBB

@merwhite11 tested and it correctly generated all the files :+1: Lgtm!

May 29 '24 19:05 cdrini

openlibrary openlibrary copied to clipboard

Added logic for page dump and commented out test line

Technical

Testing

Screenshot

Stakeholders

openlibrary
openlibrary copied to clipboard