wget2
wget2 copied to clipboard
Skipping files with "known ETag"
When attempting to download all HTML files and images from a website some of the files are being skipped.
wget2 --server-response --max-threads=20 --content-on-error -r -l0 -A.html,jpg,webp,png https://example.com 2>&1 >> log.txt
The log shows some of the files are not downloaded due to "known ETag"
"Not scanning 'https://example.com/page/' (known ETag)"
Is this an issue/limitation with wget2 or is there a way for me to download those files as well?
The ETag is a unique ID for the response content. If 2 different URLs have the same ETag, it means the response is identical. In that case, parsing or downloading the second URL would not add any value.
But I see your point that sometimes, you just want to switch this behavior off. This is currently not possible.
You can of course use the good old wget - it ignores ETags completely.
I am curious if that makes any difference to the number of files you are interested in.
The ETag is a unique ID for the response content. If 2 different URLs have the same ETag, it means the response is identical. In that case, parsing or downloading the second URL would not add any value.
But I see your point that sometimes, you just want to switch this behavior off. This is currently not possible.
That makes sense to skip files that are the same. However, I'm confused, in this case, why it is happening, as the pages are not duplicates, but nothing is being downloaded for that page?
Could I ask, why would two pages share an ETag?
Is there some way I can view these ETags to see which pages have duplicate ETags?
Also, I updated my script to use wget2 over wget as wget2 is much faster.
Also, I updated my script to use wget2 over wget as wget2 is much faster.
Sure. But maybe you can run your script only 1x with wget just to see whether the ETag handling makes a difference or not.
That makes sense to skip files that are the same. However, I'm confused, in this case, why it is happening, as the pages are not duplicates, but nothing is being downloaded for that page?
Could I ask, why would two pages share an ETag?
Let's assume that the server isn't buggy :)
A relatively common case where the same file is behind two different URL is
- www.example.com/direcory/
- www.example.com/direcory/index.html