zim-tools
zim-tools copied to clipboard
zimdump - Bypassing excessively long filenames.
Hello,
The filenames in a zim im trying to recover are too long, is it possible to add a flag or similar to bypass this or truncate it?
The error is "Exception: Error writing file to errors dir. " with an excessively long file name following.
Thanks
It should already be the case, we truncate filename longer than 255 chars. What is the file name you want to extract ? Which command are you using ?
Im going to replace the site details with an equal number of x's for privacy, hope that's okay.
ZIMtools Version: 3.1.1-2
Command /home/localadmin/Downloads/zim-tools_linux-x86_64-3.1.1-2/zimdump dump --dir='/mnt/2TB' ./xxxxxxxxxxxxx.com-May2022_2022-05.zim
Filename Wrote /mnt/2TB/H/xxxxxxxxxxxxx.com/category/xxxxxxxxxxxxxxxxx/xxxxxxxxxxxxxxx/page/43/ to /mnt/2TB/_exceptions/H%2fxxxxxxxxxxxxx.com%2fxxxxxxxx%2fxxxxxxxxxxxxxxxxx%2fxxxxxxxxxxxxxxx%2fpage%2f43%2f Wrote /mnt/2TB/H/www.bitchute.com/embed/xxxxxxxxxxx/view/ to /mnt/2TB/_exceptions/H%2fwww.bitchute.com%2fembed%xxxxxxxxxxx%2fview%2f Error writing file to errors dir. /mnt/2TB/_exceptions/A%2fhtml5-player.libsyn.com%2fembed%2fepisode%2xxx%2xxxxxxxx%2fheight%2f90%2fwidth%2f640%2ftheme%2fcustom%2fautonext%2fno%2fthumbnail%2fyes%2fautoplay%2fno%2fpreload%2fno%2fno_addthis%2fno%2fdirection%2fbackward%2frender-playlist%2fno%2fcustom-color%2f0009aa%2f Exception: Error writing file to errors dir. /mnt/2TB/_exceptions/A%2fhtml5-player.libsyn.com%2fembed%2fepisode%2xxx%2xxxxxxxx%2fheight%2f90%2fwidth%2f640%2ftheme%2fcustom%2fautonext%2fno%2fthumbnail%2fyes%2fautoplay%2fno%2fpreload%2fno%2fno_addthis%2fno%2fdirection%2fbackward%2frender-playlist%2fno%2fcustom-color%2f0009aa%2f
{end of program output}
This is 262 characters which is over the 255 limit for file names.
What is failing is the writing of errored file.
zim dump try to write the file in your out directory (/mnt/2TB) but if it fails for some reason, it will write the file in the exception directory (/mnt/2TB/_exceptions), when doing so, it replace all / by %2f (so there is no subdirectory) and it doesn't try to truncate the filename.
The question is why it fails to write the file in the first instance ? (Sadly, zimdump doesn't report the error information)
- What is the content of
mnt/2TB/A/html5-player.libsyn.com/embed/episode/xx/xxxxxxx/height/90/width/640/theme/custom/autonext/no/thumbnail/yes/autoplay/no/preload/no/no_addthis/no/direction/backward/render-playlist/no/custom-color/0009aa/? - Is your directory full ? Or with some quota ?
- ...
Hello sir,
Firstly, thank you for your time.
"What is the content of mnt/2T" The content is likely related to a MP3 player applet, yes I know this is not supported and im sorry to use this outside the scope.
"Is your directory full ? Or with some quota ?" First thing i checked, no sir, almost a TB free, is set the directory to 'chmod -R ./* 777' so its not permissions.
My purposed solution: Provide a switch to bypass errors and just continue extracting. This would likely be simple and easy, in my limited knowledge.
Thank you.
I agree with you proposal. But it would even be better to also know why it fails. If you can provide me the zim file (even privatly if you don't what to publish it), I could investigate what is the root cause and fix it.
No way to build anyt
My purposed solution: Provide a switch to bypass errors and just continue extracting. This would likely be simple and easy, in my limited knowledge.
We should first 100% understand the root cause.
Also having this issue, can provide a zim file for tests. Although it is around 3 GB in size, not sure where to upload it
You can use any file share service, for exemple wetransfer.com or file.io They are limited to 2GB files but you can cut the zim file.
I will try to limit crawl scope so the file would be less than 2 gb in size asap
I made some simple modifications to ignore errors: https://github.com/openzim/zim-tools/pull/375
It now creates a more complete extraction. This could be incorporated into improved feedback to the user, options to ignore the errors, or hash the invalid names...
https://github.com/openzim/zim-tools/issues/373 In my case, invalid characters and long names cause the dump to error out and stop, with my modifications it keeps going and ignore the invalid and long names.
For example:
❯ ./zim-tools/build/src/zimdump dump --dir=./dump3 archive.zim
Error writing file to errors dir. ./dump3/_exceptions/H%2fplay.google.com%2flog?format=json&hasfast=true&authuser=0&__wb_method=POST&[[1,null,null,null,null,null,null,null,null,null,[null,null,null,null,"en",null,"17",null,null,[1,0,0,0,0]]],1654,[["1696854400954",null,[],null,null,null,null,"[[[\"%2fclient_streamz%2fpo%2fw%2fel\",null,[\"en\",\"rk\"],[[[[\"c\"],[\"O43z0dpjhgX20SCx4KAo\"]],[null,1]],[[[\"c\"],[\"O43z0dpjhgX20SCx4KAo\"]],[null,0]],[[[\"q\"],[\"O43z0dpjhgX20SCx4KAo\"]],[null,0.8000030517578125]],[[[\"S\"],[\"O43z0dpjhgX20SCx4KAo\"]],[null,3.5999984741210938]],[[[\"b\"],[\"O43z0dpjhgX20SCx4KAo\"]],[null,199.0999984741211]],[[[\"i\"],[\"O43z0dpjhgX20SCx4KAo\"]],[null,1.5]],[[[\"r\"],[\"O43z0dpjhgX20SCx4KAo\"]],[null,3440.6000061035156]],[[[\"C\"],[\"O43z0dpjhgX20SCx4KAo\"]],[null,2.4000015258789062]],[[[\"x\"],[\"O43z0dpjhgX20SCx4KAo\"]],[null,0]],[[[\"m\"],[\"O43z0dpjhgX20SCx4KAo\"]],[null,3.100006103515625]]],null,[]],[\"%2fclient_streamz%2fpo%2fw%2frl\",null,[\"mn\",\"ac\",\"sc\",\"rk\"],[[[[\"c\"],[null,1],[null,0],[\"O43z0dpjhgX20SCx4KAo\"]],[null,1887.900001525879]],[[[\"g\"],[null,1],[null,0],[\"O43z0dpjhgX20SCx4KAo\"]],[null,1331.400001525879]]],null,[]],[\"%2fclient_streamz%2fpo%2fw%2fcsc\",null,[\"cs\",\"rk\"],[[[[null,3],[\"O43z0dpjhgX20SCx4KAo\"]],[1]]],null,[]]]]",null,null,null,null,null,null,0,[null,[],null,"[[],[],[],[]]"],null,null,null,[],1,null,null,null,null,null,[]]],"1696854400955",[]]
Error writing file to errors dir. ./dump3/_exceptions/A%2fplay.google.com%2flog?format=json&hasfast=true&authuser=0&__wb_method=POST&[[1,null,null,null,null,null,null,null,null,null,[null,null,null,null,"en",null,"17",null,null,[1,0,0,0,0]]],1654,[["1696854400954",null,[],null,null,null,null,"[[[\"%2fclient_streamz%2fpo%2fw%2fel\",null,[\"en\",\"rk\"],[[[[\"c\"],[\"O43z0dpjhgX20SCx4KAo\"]],[null,1]],[[[\"c\"],[\"O43z0dpjhgX20SCx4KAo\"]],[null,0]],[[[\"q\"],[\"O43z0dpjhgX20SCx4KAo\"]],[null,0.8000030517578125]],[[[\"S\"],[\"O43z0dpjhgX20SCx4KAo\"]],[null,3.5999984741210938]],[[[\"b\"],[\"O43z0dpjhgX20SCx4KAo\"]],[null,199.0999984741211]],[[[\"i\"],[\"O43z0dpjhgX20SCx4KAo\"]],[null,1.5]],[[[\"r\"],[\"O43z0dpjhgX20SCx4KAo\"]],[null,3440.6000061035156]],[[[\"C\"],[\"O43z0dpjhgX20SCx4KAo\"]],[null,2.4000015258789062]],[[[\"x\"],[\"O43z0dpjhgX20SCx4KAo\"]],[null,0]],[[[\"m\"],[\"O43z0dpjhgX20SCx4KAo\"]],[null,3.100006103515625]]],null,[]],[\"%2fclient_streamz%2fpo%2fw%2frl\",null,[\"mn\",\"ac\",\"sc\",\"rk\"],[[[[\"c\"],[null,1],[null,0],[\"O43z0dpjhgX20SCx4KAo\"]],[null,1887.900001525879]],[[[\"g\"],[null,1],[null,0],[\"O43z0dpjhgX20SCx4KAo\"]],[null,1331.400001525879]]],null,[]],[\"%2fclient_streamz%2fpo%2fw%2fcsc\",null,[\"cs\",\"rk\"],[[[[null,3],[\"O43z0dpjhgX20SCx4KAo\"]],[1]]],null,[]]]]",null,null,null,null,null,null,0,[null,[],null,"[[],[],[],[]]"],null,null,null,[],1,null,null,null,null,null,[]]],"1696854400955",[]]
Sample file for tests: https://www.swisstransfer.com/d/306ae305-8b17-455f-862f-13c15ca93121
This is basically a duplicate of #213