wget2 icon indicating copy to clipboard operation
wget2 copied to clipboard

Most added files don't get downloaded

Open page200 opened this issue 1 year ago • 1 comments

I try to crawl a website with wget2 v2.1.0. According to the log, lots of files get "Added", but only for a small percentage of them a "Downloading" thread starts. Then the log says "Downloaded: ... files, ... bytes, ... redirects, ... errors". All those numbers are much smaller than the number of "Added" URLs. Why does wget2 never start downloading most of the "Added" URLs? When I use the --debug flag on Windows, wget2 crashes with something like "wget2.exe has stopped working" after some minutes. Before the final log message, even without --debug, please make it always log how many URLs were "Added" and why they are not being downloaded.

page200 avatar Feb 04 '24 14:02 page200

I see your pain :| There could be many reasons why an URL isn't downloaded. To help you, I'd like to reproduce the crash first (so you can use --debug). For that, I need the full command line example that ends up in a crash.

rockdaboot avatar Feb 04 '24 17:02 rockdaboot

Didn't get further information.

rockdaboot avatar Apr 01 '24 16:04 rockdaboot

There could be many reasons why an URL isn't downloaded.

Please let wget2 print those reasons.

page200 avatar Apr 01 '24 17:04 page200

There could be many reasons why an URL isn't downloaded.

Please let wget2 print those reasons.

Wget2 does, but only in debug mode (--debug) or if the progress bar is switched off. Example:

$ wget2 --progress=none -r google.com
[0] Downloading 'http://google.com/robots.txt' ...
HTTP response 301 Moved Permanently [http://google.com/robots.txt]
Adding URL: https://www.google.com/robots.txt
URL 'https://www.google.com/robots.txt' not followed (no host-spanning requested)
[0] Downloading 'http://google.com' ...
HTTP response 301 Moved Permanently [http://google.com]
Adding URL: http://www.google.com/
URL 'http://www.google.com/' not followed (no host-spanning requested)
Downloaded: 0 files, 449  bytes, 2 redirects, 0 errors

So if you look for not followed in the output, you'll find the reason for each URL not followed. Use -o log.txt to send all the output into the file log.txt for later analysis.

rockdaboot avatar Apr 01 '24 17:04 rockdaboot

Many URLs appear only once in the log, in the "Adding URL" line. There is no "not followed" message about those URLs, nor any other message about them.

wget2 adds them and never mentions them again. Later, the log says "Downloaded: ... files, ... bytes, ... redirects, ... errors", so wget2 apparently plans no further downloads, but doesn't explain why the added URLs aren't mentioned ever again.

Please let wget2 report what happened to the added URLs.

page200 avatar Apr 01 '24 17:04 page200

This sounds like a bug. If an URL has been added and there is no further notice of "not followed", it should be downloaded. Ah, sometimes the output is in a weird order because of multi-threading. Just to make sure, can you test with --max-threads=1 and check again?

If the issue persists, can you give me a full command line (feel free to send it privately to me - my email is in the git commits or close to the bottom of man wget2).

rockdaboot avatar Apr 01 '24 18:04 rockdaboot