wget2 icon indicating copy to clipboard operation
wget2 copied to clipboard

Skipping files with "known ETag"

Open user98765446 opened this issue 1 year ago • 6 comments

When attempting to download all HTML files and images from a website some of the files are being skipped.

wget2 --server-response --max-threads=20 --content-on-error -r -l0 -A.html,jpg,webp,png  https://example.com 2>&1 >> log.txt 

The log shows some of the files are not downloaded due to "known ETag"

"Not scanning 'https://example.com/page/' (known ETag)"

Is this an issue/limitation with wget2 or is there a way for me to download those files as well?

user98765446 avatar Apr 15 '24 19:04 user98765446

The ETag is a unique ID for the response content. If 2 different URLs have the same ETag, it means the response is identical. In that case, parsing or downloading the second URL would not add any value.

But I see your point that sometimes, you just want to switch this behavior off. This is currently not possible.

rockdaboot avatar Apr 16 '24 17:04 rockdaboot

You can of course use the good old wget - it ignores ETags completely. I am curious if that makes any difference to the number of files you are interested in.

rockdaboot avatar Apr 16 '24 17:04 rockdaboot

The ETag is a unique ID for the response content. If 2 different URLs have the same ETag, it means the response is identical. In that case, parsing or downloading the second URL would not add any value.

But I see your point that sometimes, you just want to switch this behavior off. This is currently not possible.

That makes sense to skip files that are the same. However, I'm confused, in this case, why it is happening, as the pages are not duplicates, but nothing is being downloaded for that page?

Could I ask, why would two pages share an ETag?

Is there some way I can view these ETags to see which pages have duplicate ETags?

user98765446 avatar Apr 16 '24 17:04 user98765446

Also, I updated my script to use wget2 over wget as wget2 is much faster.

user98765446 avatar Apr 16 '24 17:04 user98765446

Also, I updated my script to use wget2 over wget as wget2 is much faster.

Sure. But maybe you can run your script only 1x with wget just to see whether the ETag handling makes a difference or not.

rockdaboot avatar Apr 20 '24 16:04 rockdaboot

That makes sense to skip files that are the same. However, I'm confused, in this case, why it is happening, as the pages are not duplicates, but nothing is being downloaded for that page?

Could I ask, why would two pages share an ETag?

Let's assume that the server isn't buggy :)

A relatively common case where the same file is behind two different URL is

  • www.example.com/direcory/
  • www.example.com/direcory/index.html

rockdaboot avatar Apr 20 '24 16:04 rockdaboot