zim-tools icon indicating copy to clipboard operation
zim-tools copied to clipboard

zimwriterfs: Weird segfault / error due to <meta http-equiv="refresh" content="0; URL=https://example.com" />

Open ballerburg9005 opened this issue 2 years ago • 11 comments

Reproduce:

mkdir test
cd test
echo "" > index.html
wget -p covid.cdc.gov
cd ..
zimwriterfs --welcome="index.html" --illustration=zim_favicon.png --language=eng --title="asddfgsdff" --description "asdf" --creator="sdfsad" --publisher "asdfsdfa" ./test test.zim

zimwriterfs: Target path doesn't exists

However, when the same covid.cdc.gov directory was part of a bigger zim file it just caused a segfault. After removing the directory, it worked as normal. I used gdb to isolate the problem to the file covid.cdc.gov/index.html. It contains nothing but this:

<!doctype html>
<html>
<head>
<title></title>
<meta http-equiv="refresh" content="0; URL=https://covid.cdc.gov/covid-data-tracker/#datatracker-home" />
</head>
<body>
</body>
</html>

ballerburg9005 avatar May 06 '22 20:05 ballerburg9005

@ballerburg9005 Thx for your bug report but this is a bug complecated to reproduce considering the size of the web site/directory. You don't have a simpler reproduction case?

kelson42 avatar May 07 '22 04:05 kelson42

Actually this is not a recursive download. It literally results in just this directory and index.html to be created and exactly this produces the error.

I have just created the same "covid.cdc.gov/index.html" directory+file inside other directories from other websites for zim files. Any one of them now produce a segfault due to this file being present. So it is not just a fluke with this one website from which I initially noticed the problem. They are all +100MB.

ballerburg9005 avatar May 07 '22 11:05 ballerburg9005

@ballerburg9005 So far I'm not able to reproduce the bug. Please upload here the ZIP with the exact content of the test directory... out of the box it can not work, there is for example nothing about zim_favicon.png!

kelson42 avatar May 07 '22 19:05 kelson42

convert -size 48x48 xc:white test/zim_favicon.png

There is nothing more to it. I also used latest zim-tools-git.

test.zip

ballerburg9005 avatar May 08 '22 01:05 ballerburg9005

I still can not reproduce the segfault, see:

$ unzip test.zip 
Archive:  test.zip
   creating: test/
   creating: test/covid.cdc.gov/
  inflating: test/covid.cdc.gov/index.html  
  inflating: test/index.html         
  inflating: test/zim_favicon.png    
$ zimwriterfs --welcome="index.html" --illustration=zim_favicon.png --language=eng --title="asddfgsdff" --description "asdf" --creator="sdfsad" --publisher "asdfsdfa" ./test test.zim
zimwriterfs: 'covid.cdc.gov/index.html' HTML redirection target path './test/covid.cdc.gov/https:/covid.cdc.gov/covid-data-tracker/#datatracker-home' doesn't exists.
$ zimwriterfs --version
zim-tools 3.1.1

libzim 7.2.1
+ libzstd 1.4.4
+ liblzma 5.2.4
+ libxapian 1.4.18
+ libicu 66.1.0

With latest zimwriterfs release (done a few days ago), you should not have a segfault, but an error and a managed stop (but not with this exact message like in my log).

kelson42 avatar May 08 '22 06:05 kelson42

Yes this is exactly according to how I described the bug, except that your error message is more verbose than mine (Target path doesn't exists).

In order for the segfault to happen the covid.cdc.gov directory needs to be part of some larger (+100MB?) directory from any other website. Try my blog for example:

wget -rp --wait=1 --tries=1 http://ballerburg.us.to
cd ballerburg.us.to
wget -p covid.cdc.gov
convert -size 48x48 xc:white zim_favicon.png
cd ..
zimwriterfs --welcome="index.html" --illustration=zim_favicon.png --language=eng --title="asddfgsdff" --description "asdf" --creator="sdfsad" --publisher "asdfsdfa" ./ballerburg.us.to ballerburg.us.to.zim

ballerburg9005 avatar May 08 '22 19:05 ballerburg9005

This is not a segfault. This is a normal error. You have a HTML redirection to an external resources. This is wrong. It should be a redirection to a local resource.

kelson42 avatar May 08 '22 19:05 kelson42

Please read again what I said.

  • if you make a zim from just the covid.cdc.gov directory it produces a strange error
  • if the covid.cdc.gov directory is part of a larger website's directory, it produces a segfault
% zimwriterfs --welcome="index.html" --illustration=zim_favicon.png --language=eng --title="asddfgsdff" --description "asdf" --creator="sdfsad" --publisher "asdfsdfa" ./ballerburg.us.to ballerburg.us.to.zim
[1]    1151879 segmentation fault  zimwriterfs --welcome="index.html" --illustration=zim_favicon.png    "asdf"  

This is a segfault. It happens if you follow my example and bug description.

Also I do not see how it is desirable to have zimwriterfs fail, just because there is some broken meta tag in some html file. When archiving websites, it just cannot be avoided that there are thousands and tens of thousands of broken references present and most of them really don't matter. If some of them caused failures in zimwriterfs, that's kind of broken. The default should be to ignore them.

ballerburg9005 avatar May 08 '22 20:05 ballerburg9005

Please produce the reproduction steps. Considering that this is two different problems, open different tickets.

Regarding the fact if zimwriterfs should stop or not. I have no strong opinion...

kelson42 avatar May 09 '22 04:05 kelson42

https://github.com/openzim/zim-tools/issues/300#issuecomment-1120474425

In order for the segfault to happen the covid.cdc.gov directory needs to be part of some larger (+100MB?) directory from any other website. Try my blog for example:

wget -rp --wait=1 --tries=1 http://ballerburg.us.to
cd ballerburg.us.to
wget -p covid.cdc.gov
convert -size 48x48 xc:white zim_favicon.png
cd ..
zimwriterfs --welcome="index.html" --illustration=zim_favicon.png --language=eng --title="asddfgsdff" --description "asdf" --creator="sdfsad" --publisher "asdfsdfa" ./ballerburg.us.to ballerburg.us.to.zim

ballerburg9005 avatar May 09 '22 09:05 ballerburg9005

I cannot reproduce the segfault. I've followed the reproduction steps you've provide (wget ballerburg.us.to and covid.cdc.gov) and no segfault:

With last release (http://download.openzim.org/release/zim-tools/zim-tools_linux-x86_64-3.1.1-1.tar.gz):

./zimwriterfs --welcome="index.html" --illustration=zim_favicon.png --language=eng --title="asddfgsdff" --description "asdf" --creator="sdfsad" --publisher "asdfsdfa" ./ballerburg.us.to ballerburg.us.to.zim
zimwriterfs: Target path doesn't exists

With last nightly (http://download.openzim.org/nightly/2022-08-10/zim-tools_linux-x86_64-2022-08-10.tar.gz):

./zimwriterfs --welcome="index.html" --illustration=zim_favicon.png --language=eng --title="asddfgsdff" --description "asdf" --creator="sdfsad" --publisher "asdfsdfa" ./ballerburg.us.to ballerburg.us.to.zim
zimwriterfs: 'covid.cdc.gov/index.html' HTML redirection target path './ballerburg.us.to/covid.cdc.gov/https:/covid.cdc.gov/covid-data-tracker/#datatracker-home' doesn't exist.

mgautierfr avatar Aug 10 '22 13:08 mgautierfr

No valid reproduction steps.

kelson42 avatar Mar 23 '23 19:03 kelson42