openlibrary
openlibrary copied to clipboard
Make data dumps for /type/page
Describe the problem that you'd like solved
I would like to get a data dump of just the /type/page
entities.
I can see that there are some in the all_types_dump
but it's too big for me to download.
I was just running:
curl -s -L https://openlibrary.org/data/ol_dump_latest.txt.gz | gunzip -c | grep '/type/page'
Why?
- I've been working on cleaning up our docs and I'd like to be able to more easily search the docs that are on openlibrary.org using tools like grep.
- As I understand it, the sitemap.xml is generated by these dumps and I'm wondering if we should in the future make the sitemaps have our pages on them for easier searching.
- I'm wondering if we can put them into solr and have a nice search for our docs but before doing that I'd like to be able to see what docs we actually have.
Proposal & Constraints
I poked around briefly at the data dump code and I think it could be as simple as adding it here: https://github.com/internetarchive/openlibrary/blob/085702675121b98907255ae204abca44cba7c51a/openlibrary/data/dump.py#L215-L221
We'd also want to make one of these special links like: https://openlibrary.org/data/ol_dump_authors_latest.txt.gz to redirect to the pages dump.
Additional context
https://github.com/internetarchive/openlibrary/wiki/Sitemap-Generation https://github.com/internetarchive/openlibrary/wiki/Generating-Data-Dumps https://openlibrary.org/developers/dumps
Stakeholders
A better approach might be to create an "everything else" dump which covers everything except the bulky types. There's no real reason for redirects, lists, etc to all have their own separate files when they are so small. This would also automatically cover any newly added types until they got bulky enough to warrant splitting out.
A better approach might be to create an "everything else" dump which covers everything except the bulky types. There's no real reason for redirects, lists, etc to all have their own separate files when they are so small. This would also automatically cover any newly added types until they got bulky enough to warrant splitting out.
Is there a way to determine the size of the large dumps, or should we just mention it in a set? If this is clarified, I can attempt to add a function that handles this logic or incorporate it into the split function.
I don't think anything super fancy or dynamic is needed. I would look into changing the logic of split_dump()
to write files for editions, works, authors, and then everything else (perhaps called "misc" or something similar). Things like user pages and admin pages are already filtered when the initial dump is written, so you don't need to worry about them. Lists and redirects total about 75MB currently, so the new "everything else" dump should be much less than 100MB and will automatically include new types as they're introduced.
(As an aside, the lists and redirects dumps are not currently mentioned on the wiki page. You can only find them by going to the dump directory.)
Can this be assigned to me if Meredith doesn't want to
@merwhite11 said she's like to work on this so I'll assign her!
I’d love to work on this. Please assign to me ! :)
@merwhite11, limit the scope of this to /type/page
data. An "everything else" data dump will need to be audited before being published, as this will include patron preferences and perhaps other personal information.
@jimchamp as I mentioned above:
Things like user pages and admin pages are already filtered when the initial dump is written, so you don't need to worry about them.
All the secondary dumps are subsets of the full dump generated by split_dump
. If you believe there's an exposure here, it already exists in the full dump.
https://github.com/internetarchive/openlibrary/blob/085702675121b98907255ae204abca44cba7c51a/openlibrary/data/dump.py#L48-L64
@jimchamp @tfmorris Paraphrasing to make sure I'm understanding correctly:
We can create a separate type in the split_dump function that grabs all misc files that don't fall into pre-existing types. Due to filters on lines 49 and 55, we don't need to worry about user data getting into dump file. Ideally this will result in a misc file < 100MB that @RayBB can use as an inventory of pages to eventually be included in sitemap.xml
Data dump -> sitemap.xml (with pages) -> sitemap in solr
eg:
types = (
"/type/edition",
"/type/author",
"/type/work",
"/type/redirect",
"/type/list",
#add a catch-all type
"/type/misc"
)
#Then add an else block to write to the misc file
stdin = xopen(dump_file, "rt") if dump_file else sys.stdin
for i, line in enumerate(stdin):
if i % 1_000_000 == 0:
log(f"split_dump {i:,}")
type, rest = line.split("\t", 1)
if type in files:
files[type].write(line)
#else files[misc].write(line)
@RayBB In terms of generating the special links (https://openlibrary.org/data/ol_dump_authors_latest.txt.gz) to redirect to the pages dump --
My understanding is that these urls are created in the make_index function in dump.py . I would need to add some logic there to account for the misc file.
Linking to the slack conversation with my progress and questions here: https://internetarchive.slack.com/archives/C0ETZV72L/p1712873368854129
@merwhite11, you may want to try something like this:
types = (
"/type/edition",
"/type/author",
"/type/work",
"/type/redirect",
"/type/list", # Remove /type/misc
)
# Create file for all other types:
files['misc'] = xopen(format % 'misc', 'wt')
# In the else block, write to the misc file
stdin = xopen(dump_file, "rt") if dump_file else sys.stdin
for i, line in enumerate(stdin):
if i % 1_000_000 == 0:
log(f"split_dump {i:,}")
type, rest = line.split("\t", 1)
if type in files:
files[type].write(line)
else:
files['misc'].write(line)
This should write all other types to a single file. Disregard what I said about limiting this to only /type/page
--- I didn't see @tfmorris's comments about this earlier. It wouldn't be trivial to do something like this anyway, as I'm noticing the type for some pages is type: {"key": "/type/page"} (for example, this collections page).
You may want to create a page locally to test for this. Here's how to make a local /collections page:
- While logged in, navigate to localhost:8080/collections
- Click the "Create it?" link
- Fill out the "Title" and "Document Body" fields, then submit the form
- Check the type at localhost:8080/collections.json
@jimchamp Thank you for these suggestions! Trying this approach and still unclear. I'm also not able to find the test 'type/page' page that I created in the dump file.
When I zcat the files being written to files['misc'] -- it is majority /type/language
with a few type/type
, type/object
, type/usergroup
and type/page
.
Are we assuming that all pages already have the type/page
associated with them? Or is it possible that a page could be labelled as a type/edition
or type/work
for example?
Basically, I'm confused as to why there are so few type/pages.
Local instances don't have much data pre-loaded. When I checked the other day, there were only three /type/page
pages there.
After implementing the above changes, I'm seeing the /collections
pages that I created in the misc
dump. Without seeing your code, I'm not sure why you can't find the page that you created. Maybe using grep
on the file would help you find it? My misc
dump has over 500 entries....
@jimchamp my bad ! I am getting my test file when I grep. A few more questions...
Can we assume that if we were running this in the prod env, there would be a lot more pages / there would be a type/page for every page in site?
In terms of generating the path to the split dump (https://openlibrary.org/data/ol_dump_authors_latest.txt.gz) , is this something I can test for? It doesn't seem to be entering the make_index function in dump.py in test mode.
Would creating the path for /type/page look something like this?
#add /type/page here
if type in ("/type/edition", "/type/work", "/type/page"):
title = data.get("title", "untitled")
path = key + "/" + urlsafe(title)
elif type in ("/type/author", "/type/list"):
title = data.get("name", "unnamed")
path = key + "/" + urlsafe(title)
else:
title = data.get("title", key)
path = key
thanks again for the help!
There will be more /type/page
s in production, but there will not be a /type/page
from each page. For example, work pages will have /type/work
, edition pages will have /type/edition
, author pages will have /type/author
, etc (see this, this, and this, respectively).
Some of our pages are /type/i18n_page
(like the root /collections
page), while others are actually /type/page
(like this collection). I'd expect your changes to capture these types and all of the other ones that we don't already have dumps for.
I don't really understand the code snippet that you provided, and I don't understand what "creating the path for /type/page
means in the given context. Could you push the code that you have now to your repo? There's no need to create a PR now, I'd just like to test your code to better understand what is happening.
Ok, that makes sense. type/page applies to pages that don't already fall into another type. In this 'misc pages dump', we want to get all the type/page
s AND all other misc types.
The code snippet is to part of the make_index function in dump.py: here
Here's my last push to my fork. I haven't made many changes..thanks for taking a look! https://github.com/internetarchive/openlibrary/compare/master...merwhite11:openlibrary:8401/Fix/Make-Change-to-oldump
Thanks. I can't find evidence of make_index
being used today, so you can revert those changes.
Make sure to remove unrelated changes before opening a PR.
@jimchamp bumping the priority of this because it would be very helpful for me and the PR has been open a few weeks.
Drini is currently in Albania and it may take some time before we can followup @RayBB