pywb
pywb copied to clipboard
Resources returned extremely slowly for large collection.
Describe the bug
Resources are returned extremely slowly (~3 minutes) for a large collection (34Gb, >1m records). While the page is loading, exactly one core of the server's CPU goes to 100% utilization.
Steps to reproduce the bug
Unfortunately, I'm not permitted to share the archive as it includes sensitive personal information.
Expected behavior
Resources are returned quickly.
Screenshots
Here's a pyspy flamegraph of wayback
handling a single request initiated by curl
: https://jswrenn.com/misc/pywb_573-profile.svg
Environment
- OS: Ubuntu 18.04
- HW: DigitalOcean VPS with 6 cores, 16GB of memory, SSD.
- py-wb version: 2.4.1
I've updated the issue to include a flamegraph of wayback
handling a single request, initiated by curl
. A substantial amount of time appears to be spent searching the index.
Thanks for including this! What does the indexes directory look like? Are there multiple cdxj files in there, or a single index?
Multiple index files:
indexes/
├── [596M] autoindex.cdxj
├── [118M] autoindex.cdxj.tmp.20200629011258909273
├── [ 0] autoindex.cdxj.tmp.20200629152048161219
├── [ 0] autoindex.cdxj.tmp.20200630020638761945
└── [611M] index.cdxj
I deleted the existing indices and re-indexed the collection, but there was no improvement.
Hm, it seems like it should still work at that size in pywb, and it'll definitely work with a compressed index.. If there's any way you can share the example privately, I can try to debug further.. but a couple of things you can try:
You can make a compressed index as explained here: https://github.com/ikreymer/webarchive-indexing#building-a-local-cluster.
It's a bit old (I'm trying to build new tools to generate the compressed index), but essentially you can run:
python build_local_zipnum.py -s 1 -l 300 ./zip/ ./cdx/path/to
and then copy the contents ofthe ./zip/
into the indices directory (and remove the uncompressed index). This requires python 2.7 at the moment.
I'll let you know when there's an updated tool to create this compressed index.
Another option is to use OutbackCDX, which many folks have been using with pywb: https://github.com/nla/outbackcdx
If there's any way you can share the example privately, I can try to debug further..
Would sharing the index be sufficient? (There's a bunch of FERPA-restricted and NDA-restricted material in the actual archive, so I'm not easily able to share that.)
Sure, I can see if I can find something, or at least compress it, and you can try the compressed version also. You can send me a link via email instead of attaching here..
If there's any way you can share the example privately, I can try to debug further..
Would sharing the index be sufficient? (There's a bunch of FERPA-restricted and NDA-restricted material in the actual archive, so I'm not easily able to share that.)
Yes, that will help with debugging.. I can also compress it and then you can try out the compressed version too.
Thanks!!! I'll send that email momentarily.
Hello,
I am using pywb to handle http.mydomain.com/https:google.com, It is working fine, But taking more time to load those websites.
Can anyone please help me to make it faster?
Thank you