zim-tools icon indicating copy to clipboard operation
zim-tools copied to clipboard

zimdump fails on long URLs

Open rgaudin opened this issue 4 years ago • 7 comments

here's the tail of the output of zimdump dump --dir /data/mqc somezim.zim:

Error writing file to errors dir. /data/mqc/_exceptions/H%2fs0.wp.com%2f_static%2f??-eJylUu1uwyAMfKERr9VaLT+mPQsEj9HwJTCL8vZzEk3NGi2KtD+Iw3fmfABDEl0MhIHAV5FcNTYUGFIXvSjeOhwfUNOV8gQrmXLR3IUxa6kLGBeVdMe4NHBp5Lr4Om8UK1PO9ljghpRk14sZbeg%2fXFMZKsyGKxmhba7NGVS1Tk8eZrnKMo9QaHT4%2fzb0if5Im1m1GkKOsZIw2erDTh5aZEk2mPKHfBXfNAGf+yRp~3
Exception: Error writing file to errors dir. /data/mqc/_exceptions/H%2fs0.wp.com%2f_static%2f??-eJylUu1uwyAMfKERr9VaLT+mPQsEj9HwJTCL8vZzEk3NGi2KtD+Iw3fmfABDEl0MhIHAV5FcNTYUGFIXvSjeOhwfUNOV8gQrmXLR3IUxa6kLGBeVdMe4NHBp5Lr4Om8UK1PO9ljghpRk14sZbeg%2fXFMZKsyGKxmhba7NGVS1Tk8eZrnKMo9QaHT4%2fzb0if5Im1m1GkKOsZIw2erDTh5aZEk2mPKHfBXfNAGf+yRp~3

Touching that file fails as well

touch: /data/mqc/_exceptions/H%2fs0.wp.com%2f_static%2f??-eJylUu1uwyAMfKERr9VaLT+mPQsEj9HwJTCL8vZzEk3NGi2KtD+Iw3fmfABDEl0MhIHAV5FcNTYUGFIXvSjeOhwfUNOV8gQrmXLR3IUxa6kLGBeVdMe4NHBp5Lr4Om8UK1PO9ljghpRk14sZbeg%2fXFMZKsyGKxmhba7NGVS1Tk8eZrnKMo9QaHT4%2fzb0if5Im1m1GkKOsZIw2erDTh5aZEk2mPKHfBXfNAGf+yRp~3: File name too long

Very long URLs seems like a common use case and I believe it calls for a design change in the way those are written to disk.

Might be related to #190

Note: this is zimdump 2.1.0

rgaudin avatar Jan 13 '21 09:01 rgaudin

I propose to:

  • truncate exception files/directory if they hare too long for the filesystem
  • Introduce an exceptions.log files in jsons to track them. This file would for each exception give the original path, the exception path, and the reason of the exception (see #108)

kelson42 avatar Mar 02 '21 06:03 kelson42

truncate exception files/directory if they hare too long for the filesystem

Care must be taken when truncating directories. We may have 3 entries :

  • "long_directory_xxxxxxxxxxyyyyyy/foo.html"
  • "long_directory_xxxxxxxxxxyyyyyy/bar.html"
  • "long_directory_xxxxxxxxxxzzzzzz/foo.html"

Both first directories must correctly truncated to the same "short name" ("long_directory~1") but the second must be different ("long_directory~2")

mgautierfr avatar Mar 02 '21 09:03 mgautierfr

Is there an example zim file available that has this problem?

adamlamar avatar Dec 21 '22 00:12 adamlamar

Is there an example zim file available that has this problem?

Please go ahead with this one that is 3GB: https://www.transfernow.net/dl/20231009UhHnE3Sy

2600box avatar Oct 11 '23 09:10 2600box

We should probably consider at the same time to ignore / replace all characters that are not allowed / interpreted differently on the target filesystem, this is causing many files to not be dumped.

benoit74 avatar Apr 04 '24 07:04 benoit74

Building a very small ZIM with many "strange" ZIM paths is probably the way to go, quite easy to do with python-libzim or python-scraperlib. This would make testing the change on many filesystems much easier.

benoit74 avatar Apr 04 '24 07:04 benoit74

Indeed but I'd like to mention that filesystems limitations are all properly documented. It should be designed with those limitations in mind as testing on various filesystems is cumbersome.

rgaudin avatar Apr 04 '24 07:04 rgaudin