pywb icon indicating copy to clipboard operation
pywb copied to clipboard

Resources returned extremely slowly for large collection.

Open jswrenn opened this issue 4 years ago • 9 comments

Describe the bug

Resources are returned extremely slowly (~3 minutes) for a large collection (34Gb, >1m records). While the page is loading, exactly one core of the server's CPU goes to 100% utilization.

Steps to reproduce the bug

Unfortunately, I'm not permitted to share the archive as it includes sensitive personal information.

Expected behavior

Resources are returned quickly.

Screenshots

Here's a pyspy flamegraph of wayback handling a single request initiated by curl: https://jswrenn.com/misc/pywb_573-profile.svg

Environment

  • OS: Ubuntu 18.04
  • HW: DigitalOcean VPS with 6 cores, 16GB of memory, SSD.
  • py-wb version: 2.4.1

jswrenn avatar Jul 01 '20 00:07 jswrenn

I've updated the issue to include a flamegraph of wayback handling a single request, initiated by curl. A substantial amount of time appears to be spent searching the index.

jswrenn avatar Jul 01 '20 16:07 jswrenn

Thanks for including this! What does the indexes directory look like? Are there multiple cdxj files in there, or a single index?

ikreymer avatar Jul 01 '20 17:07 ikreymer

Multiple index files:

indexes/
├── [596M]  autoindex.cdxj
├── [118M]  autoindex.cdxj.tmp.20200629011258909273
├── [   0]  autoindex.cdxj.tmp.20200629152048161219
├── [   0]  autoindex.cdxj.tmp.20200630020638761945
└── [611M]  index.cdxj

jswrenn avatar Jul 01 '20 22:07 jswrenn

I deleted the existing indices and re-indexed the collection, but there was no improvement.

jswrenn avatar Jul 02 '20 18:07 jswrenn

Hm, it seems like it should still work at that size in pywb, and it'll definitely work with a compressed index.. If there's any way you can share the example privately, I can try to debug further.. but a couple of things you can try:

You can make a compressed index as explained here: https://github.com/ikreymer/webarchive-indexing#building-a-local-cluster.

It's a bit old (I'm trying to build new tools to generate the compressed index), but essentially you can run: python build_local_zipnum.py -s 1 -l 300 ./zip/ ./cdx/path/to and then copy the contents ofthe ./zip/ into the indices directory (and remove the uncompressed index). This requires python 2.7 at the moment. I'll let you know when there's an updated tool to create this compressed index.

Another option is to use OutbackCDX, which many folks have been using with pywb: https://github.com/nla/outbackcdx

ikreymer avatar Jul 02 '20 20:07 ikreymer

If there's any way you can share the example privately, I can try to debug further..

Would sharing the index be sufficient? (There's a bunch of FERPA-restricted and NDA-restricted material in the actual archive, so I'm not easily able to share that.)

jswrenn avatar Jul 02 '20 20:07 jswrenn

Sure, I can see if I can find something, or at least compress it, and you can try the compressed version also. You can send me a link via email instead of attaching here..

If there's any way you can share the example privately, I can try to debug further..

Would sharing the index be sufficient? (There's a bunch of FERPA-restricted and NDA-restricted material in the actual archive, so I'm not easily able to share that.)

Yes, that will help with debugging.. I can also compress it and then you can try out the compressed version too.

ikreymer avatar Jul 02 '20 20:07 ikreymer

Thanks!!! I'll send that email momentarily.

jswrenn avatar Jul 02 '20 20:07 jswrenn

Hello,

I am using pywb to handle http.mydomain.com/https:google.com, It is working fine, But taking more time to load those websites.

Can anyone please help me to make it faster?

Thank you

wdcs-nikhilvibhani avatar Jul 15 '20 13:07 wdcs-nikhilvibhani