openlibrary icon indicating copy to clipboard operation
openlibrary copied to clipboard

Make data dumps for /type/page

Open RayBB opened this issue 8 months ago • 20 comments

Describe the problem that you'd like solved

I would like to get a data dump of just the /type/page entities. I can see that there are some in the all_types_dump but it's too big for me to download.

I was just running: curl -s -L https://openlibrary.org/data/ol_dump_latest.txt.gz | gunzip -c | grep '/type/page'

Why?

  1. I've been working on cleaning up our docs and I'd like to be able to more easily search the docs that are on openlibrary.org using tools like grep.
  2. As I understand it, the sitemap.xml is generated by these dumps and I'm wondering if we should in the future make the sitemaps have our pages on them for easier searching.
  3. I'm wondering if we can put them into solr and have a nice search for our docs but before doing that I'd like to be able to see what docs we actually have.

Proposal & Constraints

I poked around briefly at the data dump code and I think it could be as simple as adding it here: https://github.com/internetarchive/openlibrary/blob/085702675121b98907255ae204abca44cba7c51a/openlibrary/data/dump.py#L215-L221

We'd also want to make one of these special links like: https://openlibrary.org/data/ol_dump_authors_latest.txt.gz to redirect to the pages dump.

Additional context

https://github.com/internetarchive/openlibrary/wiki/Sitemap-Generation https://github.com/internetarchive/openlibrary/wiki/Generating-Data-Dumps https://openlibrary.org/developers/dumps

Stakeholders

RayBB avatar Oct 08 '23 22:10 RayBB

A better approach might be to create an "everything else" dump which covers everything except the bulky types. There's no real reason for redirects, lists, etc to all have their own separate files when they are so small. This would also automatically cover any newly added types until they got bulky enough to warrant splitting out.

tfmorris avatar Oct 09 '23 18:10 tfmorris

A better approach might be to create an "everything else" dump which covers everything except the bulky types. There's no real reason for redirects, lists, etc to all have their own separate files when they are so small. This would also automatically cover any newly added types until they got bulky enough to warrant splitting out.

Is there a way to determine the size of the large dumps, or should we just mention it in a set? If this is clarified, I can attempt to add a function that handles this logic or incorporate it into the split function.

Billa05 avatar Mar 18 '24 22:03 Billa05

I don't think anything super fancy or dynamic is needed. I would look into changing the logic of split_dump() to write files for editions, works, authors, and then everything else (perhaps called "misc" or something similar). Things like user pages and admin pages are already filtered when the initial dump is written, so you don't need to worry about them. Lists and redirects total about 75MB currently, so the new "everything else" dump should be much less than 100MB and will automatically include new types as they're introduced.

(As an aside, the lists and redirects dumps are not currently mentioned on the wiki page. You can only find them by going to the dump directory.)

tfmorris avatar Mar 19 '24 14:03 tfmorris

Can this be assigned to me if Meredith doesn't want to

Realmbird avatar Apr 09 '24 16:04 Realmbird

@merwhite11 said she's like to work on this so I'll assign her!

RayBB avatar Apr 09 '24 16:04 RayBB

I’d love to work on this. Please assign to me ! :)

merwhite11 avatar Apr 09 '24 16:04 merwhite11

@merwhite11, limit the scope of this to /type/page data. An "everything else" data dump will need to be audited before being published, as this will include patron preferences and perhaps other personal information.

jimchamp avatar Apr 09 '24 22:04 jimchamp

@jimchamp as I mentioned above:

Things like user pages and admin pages are already filtered when the initial dump is written, so you don't need to worry about them.

All the secondary dumps are subsets of the full dump generated by split_dump. If you believe there's an exposure here, it already exists in the full dump.

https://github.com/internetarchive/openlibrary/blob/085702675121b98907255ae204abca44cba7c51a/openlibrary/data/dump.py#L48-L64

tfmorris avatar Apr 10 '24 14:04 tfmorris

@jimchamp @tfmorris Paraphrasing to make sure I'm understanding correctly:

We can create a separate type in the split_dump function that grabs all misc files that don't fall into pre-existing types. Due to filters on lines 49 and 55, we don't need to worry about user data getting into dump file. Ideally this will result in a misc file < 100MB that @RayBB can use as an inventory of pages to eventually be included in sitemap.xml

Data dump -> sitemap.xml (with pages) -> sitemap in solr

eg:

    types = (
        "/type/edition",
        "/type/author",
        "/type/work",
        "/type/redirect",
        "/type/list",
        #add a catch-all type
        "/type/misc"
    )

#Then add an else block to write to the misc file
    stdin = xopen(dump_file, "rt") if dump_file else sys.stdin
    for i, line in enumerate(stdin):
        if i % 1_000_000 == 0:
            log(f"split_dump {i:,}")
        type, rest = line.split("\t", 1)
        if type in files:
            files[type].write(line)
        #else files[misc].write(line)
     

merwhite11 avatar Apr 11 '24 20:04 merwhite11

@RayBB In terms of generating the special links (https://openlibrary.org/data/ol_dump_authors_latest.txt.gz) to redirect to the pages dump --

My understanding is that these urls are created in the make_index function in dump.py . I would need to add some logic there to account for the misc file.

merwhite11 avatar Apr 11 '24 20:04 merwhite11

Linking to the slack conversation with my progress and questions here: https://internetarchive.slack.com/archives/C0ETZV72L/p1712873368854129

merwhite11 avatar Apr 17 '24 03:04 merwhite11

@merwhite11, you may want to try something like this:

    types = (
        "/type/edition",
        "/type/author",
        "/type/work",
        "/type/redirect",
        "/type/list",  # Remove /type/misc
    )

    # Create file for all other types:
    files['misc'] = xopen(format % 'misc', 'wt')

    # In the else block, write to the misc file
    stdin = xopen(dump_file, "rt") if dump_file else sys.stdin
    for i, line in enumerate(stdin):
        if i % 1_000_000 == 0:
            log(f"split_dump {i:,}")
        type, rest = line.split("\t", 1)
        if type in files:
            files[type].write(line)
        else:
            files['misc'].write(line)

This should write all other types to a single file. Disregard what I said about limiting this to only /type/page --- I didn't see @tfmorris's comments about this earlier. It wouldn't be trivial to do something like this anyway, as I'm noticing the type for some pages is type: {"key": "/type/page"} (for example, this collections page).

You may want to create a page locally to test for this. Here's how to make a local /collections page:

  1. While logged in, navigate to localhost:8080/collections
  2. Click the "Create it?" link
  3. Fill out the "Title" and "Document Body" fields, then submit the form
  4. Check the type at localhost:8080/collections.json

jimchamp avatar Apr 18 '24 00:04 jimchamp

@jimchamp Thank you for these suggestions! Trying this approach and still unclear. I'm also not able to find the test 'type/page' page that I created in the dump file.

When I zcat the files being written to files['misc'] -- it is majority /type/language with a few type/type , type/object, type/usergroup and type/page.

Are we assuming that all pages already have the type/page associated with them? Or is it possible that a page could be labelled as a type/edition or type/work for example?

Basically, I'm confused as to why there are so few type/pages.

merwhite11 avatar Apr 18 '24 22:04 merwhite11

Local instances don't have much data pre-loaded. When I checked the other day, there were only three /type/page pages there.

After implementing the above changes, I'm seeing the /collections pages that I created in the misc dump. Without seeing your code, I'm not sure why you can't find the page that you created. Maybe using grep on the file would help you find it? My misc dump has over 500 entries....

jimchamp avatar Apr 19 '24 01:04 jimchamp

@jimchamp my bad ! I am getting my test file when I grep. A few more questions...

Can we assume that if we were running this in the prod env, there would be a lot more pages / there would be a type/page for every page in site?

In terms of generating the path to the split dump (https://openlibrary.org/data/ol_dump_authors_latest.txt.gz) , is this something I can test for? It doesn't seem to be entering the make_index function in dump.py in test mode.

Would creating the path for /type/page look something like this?

  #add /type/page here
 if type in ("/type/edition", "/type/work", "/type/page"):
            title = data.get("title", "untitled")
            path = key + "/" + urlsafe(title)
        elif type in ("/type/author", "/type/list"):
            title = data.get("name", "unnamed")
            path = key + "/" + urlsafe(title)
        else:
            title = data.get("title", key)
            path = key

thanks again for the help!

merwhite11 avatar Apr 19 '24 17:04 merwhite11

There will be more /type/pages in production, but there will not be a /type/page from each page. For example, work pages will have /type/work, edition pages will have /type/edition, author pages will have /type/author, etc (see this, this, and this, respectively).

Some of our pages are /type/i18n_page (like the root /collections page), while others are actually /type/page (like this collection). I'd expect your changes to capture these types and all of the other ones that we don't already have dumps for.

I don't really understand the code snippet that you provided, and I don't understand what "creating the path for /type/page means in the given context. Could you push the code that you have now to your repo? There's no need to create a PR now, I'd just like to test your code to better understand what is happening.

jimchamp avatar Apr 19 '24 18:04 jimchamp

Ok, that makes sense. type/page applies to pages that don't already fall into another type. In this 'misc pages dump', we want to get all the type/pages AND all other misc types.

The code snippet is to part of the make_index function in dump.py: here

Here's my last push to my fork. I haven't made many changes..thanks for taking a look! https://github.com/internetarchive/openlibrary/compare/master...merwhite11:openlibrary:8401/Fix/Make-Change-to-oldump

merwhite11 avatar Apr 19 '24 19:04 merwhite11

Thanks. I can't find evidence of make_index being used today, so you can revert those changes.

Make sure to remove unrelated changes before opening a PR.

jimchamp avatar Apr 19 '24 20:04 jimchamp

@jimchamp bumping the priority of this because it would be very helpful for me and the PR has been open a few weeks.

RayBB avatar May 10 '24 17:05 RayBB

Drini is currently in Albania and it may take some time before we can followup @RayBB

mekarpeles avatar May 13 '24 19:05 mekarpeles