wget-2-zim wget: Cannot write to page (Is a directory).

wget: Cannot write to page (Is a directory).

Open ilyaigpetrov opened this issue 1 year ago • 1 comments

I've noticed some index.html files were missing after scraping a site with your script. Seems the problem is that if wget downloads some ~binary~ files to a directory then a html page at this directory's path cant be saved to index.html. See example below.

I suggest adding --trust-server-names opt to wget, but I haven't had enough time to test it yet.

$ tree
example.com
├── index.html
└── main
    ├── index.html
    └── logo.png
$ cat example.com/index.html
<!DOCTYPE html>
<a href="./main/logo.png">MAIN LOGO</a>
<a href="./main">MAIN PAGE</a>
$ cd example.com && python3 -m http.server
$ wget -r http://localhost:8000
‘localhost:8080/index.html’ saved
‘localhost:8080/main/logo.png’ saved
Cannot write to ‘localhost:8080/main’ (Is a directory).

example.com.zip

Oct 07 '23 00:10 ilyaigpetrov

Was able to reproduce with non-binary files too.

non-binary.zip

Feb 18 '24 05:02 ilyaigpetrov

wget-2-zim wget-2-zim copied to clipboard

wget: Cannot write to page (Is a directory).

wget-2-zim
wget-2-zim copied to clipboard