wget2
wget2 copied to clipboard
Most added files don't get downloaded
I try to crawl a website with wget2 v2.1.0. According to the log, lots of files get "Added", but only for a small percentage of them a "Downloading" thread starts. Then the log says "Downloaded: ... files, ... bytes, ... redirects, ... errors". All those numbers are much smaller than the number of "Added" URLs. Why does wget2 never start downloading most of the "Added" URLs? When I use the --debug
flag on Windows, wget2 crashes with something like "wget2.exe has stopped working" after some minutes. Before the final log message, even without --debug
, please make it always log how many URLs were "Added" and why they are not being downloaded.
I see your pain :| There could be many reasons why an URL isn't downloaded. To help you, I'd like to reproduce the crash first (so you can use --debug). For that, I need the full command line example that ends up in a crash.
Didn't get further information.
There could be many reasons why an URL isn't downloaded.
Please let wget2 print those reasons.
There could be many reasons why an URL isn't downloaded.
Please let wget2 print those reasons.
Wget2 does, but only in debug mode (--debug
) or if the progress bar is switched off.
Example:
$ wget2 --progress=none -r google.com
[0] Downloading 'http://google.com/robots.txt' ...
HTTP response 301 Moved Permanently [http://google.com/robots.txt]
Adding URL: https://www.google.com/robots.txt
URL 'https://www.google.com/robots.txt' not followed (no host-spanning requested)
[0] Downloading 'http://google.com' ...
HTTP response 301 Moved Permanently [http://google.com]
Adding URL: http://www.google.com/
URL 'http://www.google.com/' not followed (no host-spanning requested)
Downloaded: 0 files, 449 bytes, 2 redirects, 0 errors
So if you look for not followed
in the output, you'll find the reason for each URL not followed. Use -o log.txt
to send all the output into the file log.txt
for later analysis.
Many URLs appear only once in the log, in the "Adding URL" line. There is no "not followed" message about those URLs, nor any other message about them.
wget2 adds them and never mentions them again. Later, the log says "Downloaded: ... files, ... bytes, ... redirects, ... errors", so wget2 apparently plans no further downloads, but doesn't explain why the added URLs aren't mentioned ever again.
Please let wget2 report what happened to the added URLs.
This sounds like a bug. If an URL has been added and there is no further notice of "not followed", it should be downloaded. Ah, sometimes the output is in a weird order because of multi-threading. Just to make sure, can you test with --max-threads=1
and check again?
If the issue persists, can you give me a full command line (feel free to send it privately to me - my email is in the git commits or close to the bottom of man wget2
).