zim-tools
zim-tools copied to clipboard
zimwriterfs: Weird segfault / error due to <meta http-equiv="refresh" content="0; URL=https://example.com" />
Reproduce:
mkdir test
cd test
echo "" > index.html
wget -p covid.cdc.gov
cd ..
zimwriterfs --welcome="index.html" --illustration=zim_favicon.png --language=eng --title="asddfgsdff" --description "asdf" --creator="sdfsad" --publisher "asdfsdfa" ./test test.zim
zimwriterfs: Target path doesn't exists
However, when the same covid.cdc.gov directory was part of a bigger zim file it just caused a segfault. After removing the directory, it worked as normal. I used gdb to isolate the problem to the file covid.cdc.gov/index.html. It contains nothing but this:
<!doctype html>
<html>
<head>
<title></title>
<meta http-equiv="refresh" content="0; URL=https://covid.cdc.gov/covid-data-tracker/#datatracker-home" />
</head>
<body>
</body>
</html>
@ballerburg9005 Thx for your bug report but this is a bug complecated to reproduce considering the size of the web site/directory. You don't have a simpler reproduction case?
Actually this is not a recursive download. It literally results in just this directory and index.html to be created and exactly this produces the error.
I have just created the same "covid.cdc.gov/index.html" directory+file inside other directories from other websites for zim files. Any one of them now produce a segfault due to this file being present. So it is not just a fluke with this one website from which I initially noticed the problem. They are all +100MB.
@ballerburg9005 So far I'm not able to reproduce the bug. Please upload here the ZIP with the exact content of the test
directory... out of the box it can not work, there is for example nothing about zim_favicon.png
!
convert -size 48x48 xc:white test/zim_favicon.png
There is nothing more to it. I also used latest zim-tools-git.
I still can not reproduce the segfault
, see:
$ unzip test.zip
Archive: test.zip
creating: test/
creating: test/covid.cdc.gov/
inflating: test/covid.cdc.gov/index.html
inflating: test/index.html
inflating: test/zim_favicon.png
$ zimwriterfs --welcome="index.html" --illustration=zim_favicon.png --language=eng --title="asddfgsdff" --description "asdf" --creator="sdfsad" --publisher "asdfsdfa" ./test test.zim
zimwriterfs: 'covid.cdc.gov/index.html' HTML redirection target path './test/covid.cdc.gov/https:/covid.cdc.gov/covid-data-tracker/#datatracker-home' doesn't exists.
$ zimwriterfs --version
zim-tools 3.1.1
libzim 7.2.1
+ libzstd 1.4.4
+ liblzma 5.2.4
+ libxapian 1.4.18
+ libicu 66.1.0
With latest zimwriterfs
release (done a few days ago), you should not have a segfault
, but an error and a managed stop (but not with this exact message like in my log).
Yes this is exactly according to how I described the bug, except that your error message is more verbose than mine (Target path doesn't exists).
In order for the segfault to happen the covid.cdc.gov directory needs to be part of some larger (+100MB?) directory from any other website. Try my blog for example:
wget -rp --wait=1 --tries=1 http://ballerburg.us.to
cd ballerburg.us.to
wget -p covid.cdc.gov
convert -size 48x48 xc:white zim_favicon.png
cd ..
zimwriterfs --welcome="index.html" --illustration=zim_favicon.png --language=eng --title="asddfgsdff" --description "asdf" --creator="sdfsad" --publisher "asdfsdfa" ./ballerburg.us.to ballerburg.us.to.zim
This is not a segfault
. This is a normal error. You have a HTML redirection to an external resources. This is wrong. It should be a redirection to a local resource.
Please read again what I said.
- if you make a zim from just the covid.cdc.gov directory it produces a strange error
- if the covid.cdc.gov directory is part of a larger website's directory, it produces a segfault
% zimwriterfs --welcome="index.html" --illustration=zim_favicon.png --language=eng --title="asddfgsdff" --description "asdf" --creator="sdfsad" --publisher "asdfsdfa" ./ballerburg.us.to ballerburg.us.to.zim
[1] 1151879 segmentation fault zimwriterfs --welcome="index.html" --illustration=zim_favicon.png "asdf"
This is a segfault. It happens if you follow my example and bug description.
Also I do not see how it is desirable to have zimwriterfs fail, just because there is some broken meta tag in some html file. When archiving websites, it just cannot be avoided that there are thousands and tens of thousands of broken references present and most of them really don't matter. If some of them caused failures in zimwriterfs, that's kind of broken. The default should be to ignore them.
Please produce the reproduction steps. Considering that this is two different problems, open different tickets.
Regarding the fact if zimwriterfs should stop or not. I have no strong opinion...
https://github.com/openzim/zim-tools/issues/300#issuecomment-1120474425
In order for the segfault to happen the covid.cdc.gov directory needs to be part of some larger (+100MB?) directory from any other website. Try my blog for example:
wget -rp --wait=1 --tries=1 http://ballerburg.us.to cd ballerburg.us.to wget -p covid.cdc.gov convert -size 48x48 xc:white zim_favicon.png cd .. zimwriterfs --welcome="index.html" --illustration=zim_favicon.png --language=eng --title="asddfgsdff" --description "asdf" --creator="sdfsad" --publisher "asdfsdfa" ./ballerburg.us.to ballerburg.us.to.zim
I cannot reproduce the segfault. I've followed the reproduction steps you've provide (wget ballerburg.us.to and covid.cdc.gov) and no segfault:
With last release (http://download.openzim.org/release/zim-tools/zim-tools_linux-x86_64-3.1.1-1.tar.gz):
./zimwriterfs --welcome="index.html" --illustration=zim_favicon.png --language=eng --title="asddfgsdff" --description "asdf" --creator="sdfsad" --publisher "asdfsdfa" ./ballerburg.us.to ballerburg.us.to.zim
zimwriterfs: Target path doesn't exists
With last nightly (http://download.openzim.org/nightly/2022-08-10/zim-tools_linux-x86_64-2022-08-10.tar.gz):
./zimwriterfs --welcome="index.html" --illustration=zim_favicon.png --language=eng --title="asddfgsdff" --description "asdf" --creator="sdfsad" --publisher "asdfsdfa" ./ballerburg.us.to ballerburg.us.to.zim
zimwriterfs: 'covid.cdc.gov/index.html' HTML redirection target path './ballerburg.us.to/covid.cdc.gov/https:/covid.cdc.gov/covid-data-tracker/#datatracker-home' doesn't exist.
No valid reproduction steps.