httrack2warc
httrack2warc copied to clipboard
Handle image errors renamed to .html
Requests for URLs with an image file extension (e.g. foo.gif) might return a HTML 404 error message. In this case HTTrack appears to write the error message to a file named foo.html but still refers to it as foo.gif in the cache and in new.txt.
I've worked around this for now by allowing the skipping of missing files if they would have an HTTP error status code. Is there a way we can detect and handle this case properly? Maybe we can implement the same conditions HTTrack has for renaming the files and probe for their existence.