pywb
pywb copied to clipboard
Seeing empty / blank screens when trying to record an URL
Describe the bug
Let's say I try to record the current google page, I'm calling
https://wayback.url.com/collection/record/https://www.google.com/
And I just see an empty screen. Is that expected? If I remove the record I also see a blank screen.
I do see logs on the server though:
2020-06-09 20:37:13,767: [INFO]: Checking Collection: collection
127.0.0.1 - - [2020-06-09 20:37:20] "GET /favicon.ico HTTP/1.1" 307 170 0.000317
2020-06-09 20:37:20,264: [DEBUG]: Starting new HTTP connection (1): localhost:33353
2020-06-09 20:37:20,266: [DEBUG]: Starting new HTTP connection (1): localhost:43203
2020-06-09 20:37:20,268: [DEBUG]: Resetting dropped connection: www.google.com
2020-06-09 20:37:20,335: [DEBUG]: https://www.google.com:443 "GET /favicon.ico HTTP/1.1" 200 1494
127.0.0.1 - - [2020-06-09 20:37:20] "POST /live/resource/postreq?param.recorder.coll=collection&url=https%3A%2F%2Fwww.google.com%2Ffavicon.ico&matchType=exact&closest=now HTTP/1.1" 200 2976 0.068667
2020-06-09 20:37:20,336: [DEBUG]: http://localhost:43203 "POST /live/resource/postreq?param.recorder.coll=collection&url=https%3A%2F%2Fwww.google.com%2Ffavicon.ico&matchType=exact&closest=now HTTP/1.1" 200 None
127.0.0.1 - - [2020-06-09 20:37:20] "POST /live/resource/postreq?param.recorder.coll=collection&url=https%3A%2F%2Fwww.google.com%2Ffavicon.ico&matchType=exact&closest=now HTTP/1.1" 200 2962 0.072407
2020-06-09 20:37:20,339: [DEBUG]: http://localhost:33353 "POST /live/resource/postreq?param.recorder.coll=collection&url=https%3A%2F%2Fwww.google.com%2Ffavicon.ico&matchType=exact&closest=now HTTP/1.1" 200 None
127.0.0.1 - - [2020-06-09 20:37:20] "GET /collection/record/https:/www.google.com/favicon.ico HTTP/1.1" 200 2122 0.079015
2020-06-09 20:37:23,770: [INFO]: Checking Collection: collection
2020-06-09 20:37:23,770: [INFO]: Auto-Indexing... ['/home/ubuntu/collections/collection/archive/rec-20200609202949540414-ip-172-31-46-60.warc.gz']
2020-06-09 20:37:23,774: [INFO]: ...Done
2020-06-09 20:37:33,784: [INFO]: Checking Collection: collection
2
Steps to reproduce the bug
Start server with
wayback --record --live -a --auto-interval 10 --debug
Go to https://wayback.url.com/collection/record/https://www.google.com/
See blank page
Expected behavior
I was expected to see the google home page or some kind of feedback on the crawling .
Screenshots
Environment
- OS: Ubuntu 18.04
- Browser Chrome
- Version 83
Are you by any chance running pywb behind some sort of reverse proxy? That should load the google home page, but it appears that like the requests are not hitting www.google.com, but an empty page?
Also, something is suspiciously removing double slashes, as you have https:/www.google.com
in the logs
GET /collection/record/https:/www.google.com/favicon.ico
This is my Apache VirtualHost configuration which would I think answer your reverse proxy question?
<IfModule mod_ssl.c>
<VirtualHost *:443>
ServerAdmin webmaster@localhost
ServerName wayback.url.com
ErrorLog ${APACHE_LOG_DIR}/wayback.url.com.error.log
CustomLog ${APACHE_LOG_DIR}/wayback.url.com.access.log combined
SSLEngine On
SSLProxyEngine On
SSLProxyVerify none
SSLProxyCheckPeerCN off
SSLProxyCheckPeerName off
ProxyPreserveHost On
ProxyPass / http://localhost:8080/
ProxyPassReverse / http://localhost:8080/
SSLCertificateFile /etc/letsencrypt/live/wayback.url.com/fullchain.pem
SSLCertificateKeyFile /etc/letsencrypt/live/wayback.url.com/privkey.pem
Include /etc/letsencrypt/options-ssl-apache.conf
</VirtualHost>
</IfModule>
@jlubeck Add below line to VirtualHost config:
RequestHeader set "X-Forwarded-Proto" expr=%{REQUEST_SCHEME}
Here is the documentation