wget2 Skipping files with "known ETag"

When attempting to download all HTML files and images from a website some of the files are being skipped.

wget2 --server-response --max-threads=20 --content-on-error -r -l0 -A.html,jpg,webp,png  https://example.com 2>&1 >> log.txt

The log shows some of the files are not downloaded due to "known ETag"

"Not scanning 'https://example.com/page/' (known ETag)"

Is this an issue/limitation with wget2 or is there a way for me to download those files as well?

Apr 15 '24 19:04 user98765446

The ETag is a unique ID for the response content. If 2 different URLs have the same ETag, it means the response is identical. In that case, parsing or downloading the second URL would not add any value.

But I see your point that sometimes, you just want to switch this behavior off. This is currently not possible.

Apr 16 '24 17:04 rockdaboot

You can of course use the good old wget - it ignores ETags completely. I am curious if that makes any difference to the number of files you are interested in.

Apr 16 '24 17:04 rockdaboot

The ETag is a unique ID for the response content. If 2 different URLs have the same ETag, it means the response is identical. In that case, parsing or downloading the second URL would not add any value.

But I see your point that sometimes, you just want to switch this behavior off. This is currently not possible.

That makes sense to skip files that are the same. However, I'm confused, in this case, why it is happening, as the pages are not duplicates, but nothing is being downloaded for that page?

Could I ask, why would two pages share an ETag?

Is there some way I can view these ETags to see which pages have duplicate ETags?

Apr 16 '24 17:04 user98765446

Also, I updated my script to use wget2 over wget as wget2 is much faster.

Apr 16 '24 17:04 user98765446

Also, I updated my script to use wget2 over wget as wget2 is much faster.

Sure. But maybe you can run your script only 1x with wget just to see whether the ETag handling makes a difference or not.

Apr 20 '24 16:04 rockdaboot

That makes sense to skip files that are the same. However, I'm confused, in this case, why it is happening, as the pages are not duplicates, but nothing is being downloaded for that page?

Could I ask, why would two pages share an ETag?

Let's assume that the server isn't buggy :)

A relatively common case where the same file is behind two different URL is

www.example.com/direcory/
www.example.com/direcory/index.html

Apr 20 '24 16:04 rockdaboot

wget2 wget2 copied to clipboard

Skipping files with "known ETag"

wget2
wget2 copied to clipboard