pywb icon indicating copy to clipboard operation
pywb copied to clipboard

Seeing empty / blank screens when trying to record an URL

Open jlubeck opened this issue 4 years ago • 3 comments

Describe the bug

Let's say I try to record the current google page, I'm calling

https://wayback.url.com/collection/record/https://www.google.com/

And I just see an empty screen. Is that expected? If I remove the record I also see a blank screen.

I do see logs on the server though:

2020-06-09 20:37:13,767: [INFO]: Checking Collection: collection
127.0.0.1 - - [2020-06-09 20:37:20] "GET /favicon.ico HTTP/1.1" 307 170 0.000317
2020-06-09 20:37:20,264: [DEBUG]: Starting new HTTP connection (1): localhost:33353
2020-06-09 20:37:20,266: [DEBUG]: Starting new HTTP connection (1): localhost:43203
2020-06-09 20:37:20,268: [DEBUG]: Resetting dropped connection: www.google.com
2020-06-09 20:37:20,335: [DEBUG]: https://www.google.com:443 "GET /favicon.ico HTTP/1.1" 200 1494
127.0.0.1 - - [2020-06-09 20:37:20] "POST /live/resource/postreq?param.recorder.coll=collection&url=https%3A%2F%2Fwww.google.com%2Ffavicon.ico&matchType=exact&closest=now HTTP/1.1" 200 2976 0.068667
2020-06-09 20:37:20,336: [DEBUG]: http://localhost:43203 "POST /live/resource/postreq?param.recorder.coll=collection&url=https%3A%2F%2Fwww.google.com%2Ffavicon.ico&matchType=exact&closest=now HTTP/1.1" 200 None
127.0.0.1 - - [2020-06-09 20:37:20] "POST /live/resource/postreq?param.recorder.coll=collection&url=https%3A%2F%2Fwww.google.com%2Ffavicon.ico&matchType=exact&closest=now HTTP/1.1" 200 2962 0.072407
2020-06-09 20:37:20,339: [DEBUG]: http://localhost:33353 "POST /live/resource/postreq?param.recorder.coll=collection&url=https%3A%2F%2Fwww.google.com%2Ffavicon.ico&matchType=exact&closest=now HTTP/1.1" 200 None
127.0.0.1 - - [2020-06-09 20:37:20] "GET /collection/record/https:/www.google.com/favicon.ico HTTP/1.1" 200 2122 0.079015
2020-06-09 20:37:23,770: [INFO]: Checking Collection: collection
2020-06-09 20:37:23,770: [INFO]: Auto-Indexing... ['/home/ubuntu/collections/collection/archive/rec-20200609202949540414-ip-172-31-46-60.warc.gz']
2020-06-09 20:37:23,774: [INFO]: ...Done
2020-06-09 20:37:33,784: [INFO]: Checking Collection: collection
2

Steps to reproduce the bug

Start server with wayback --record --live -a --auto-interval 10 --debug Go to https://wayback.url.com/collection/record/https://www.google.com/ See blank page

Expected behavior

I was expected to see the google home page or some kind of feedback on the crawling .

Screenshots

Environment

  • OS: Ubuntu 18.04
  • Browser Chrome
  • Version 83

jlubeck avatar Jun 09 '20 20:06 jlubeck

Are you by any chance running pywb behind some sort of reverse proxy? That should load the google home page, but it appears that like the requests are not hitting www.google.com, but an empty page?

Also, something is suspiciously removing double slashes, as you have https:/www.google.com in the logs

GET /collection/record/https:/www.google.com/favicon.ico

ikreymer avatar Jun 09 '20 21:06 ikreymer

This is my Apache VirtualHost configuration which would I think answer your reverse proxy question?

<IfModule mod_ssl.c>
<VirtualHost *:443>
        ServerAdmin webmaster@localhost
        ServerName wayback.url.com
        ErrorLog ${APACHE_LOG_DIR}/wayback.url.com.error.log
        CustomLog ${APACHE_LOG_DIR}/wayback.url.com.access.log combined

    SSLEngine On
    SSLProxyEngine On
    SSLProxyVerify none
    SSLProxyCheckPeerCN off
    SSLProxyCheckPeerName off
    ProxyPreserveHost On
    ProxyPass / http://localhost:8080/
    ProxyPassReverse / http://localhost:8080/

    SSLCertificateFile /etc/letsencrypt/live/wayback.url.com/fullchain.pem
    SSLCertificateKeyFile /etc/letsencrypt/live/wayback.url.com/privkey.pem
    Include /etc/letsencrypt/options-ssl-apache.conf
</VirtualHost>
</IfModule>

jlubeck avatar Jun 09 '20 21:06 jlubeck

@jlubeck Add below line to VirtualHost config:

RequestHeader set "X-Forwarded-Proto" expr=%{REQUEST_SCHEME}

Here is the documentation

sydoluciani avatar Nov 28 '20 06:11 sydoluciani