E-Hentai-Downloader icon indicating copy to clipboard operation
E-Hentai-Downloader copied to clipboard

[Suggestion] Big zip file handling

Open typhoon71 opened this issue 7 years ago • 6 comments

Since there are limits in the size of the zip file generated from the downloaded images, would it be possible to automatically fdo partial zips?

Like, there are 500 images (say 600MB), the plugin could get 100 images, zip them, save, do another 100, and so on.

The idea would be having split archives, or just more than one archive but with a counter after the name, so viewers would "find" the part easily.

typhoon71 avatar Jan 14 '17 21:01 typhoon71

This has already in plan, but not working on it. Anyway thanks for your reminding.

ccloli avatar Jan 24 '17 04:01 ccloli

Oh, I didn't notice there was a plan/todo; nice to know this is planned, thanks.

typhoon71 avatar Jan 24 '17 07:01 typhoon71

Hi, sorry for late reply, thought still not working on it, I have some questions about split archives and I needs your feedback (also the all that want this features).

I think I need to refactoring the code in some day, the version 2 of this project, then I can add this feature when starting design the code structure. But not sure when to start (really too lazy to do anything and ready to do a job m(_ _)m), and should it also be implemented to Node.js (no need to worry about memory usage but may hard to use) or other else, but at least it'll be a user-script version.

Firstly, as we don't know how large is a image file, I think use archive maximum split size is better, likes store each parts no more than 100 MB.

  1. Should the correct order of images / archives is needed?

    As you may know, the images is downloaded in async ways, that means the latter images maybe finished than the former, like this:

    #1: 1.jpg - 800KB - 100% succeed
    #2: 2.jpg - 500KB - 100% succeed
    #3: 3.jpg - 500KB - 44.8% downloading
    #4: 4.jpg - 500KB - 100% succeed
    #5: 5.jpg - 800KB - 100% succeed
    

    So that if we set a split size at 2 MB, if we needs to keep the order, then it will wait until 3.jpg finished then get 2 archives: one is 1.jpg, 2.jpg, 3.jpg and another one is 4.jpg, 5.jpg. But 3.jpg may be takes a long time so that we can't finished the next archive. Even we can know all the images that should be in which part of archives, maybe each part also have one or two images that freeze the progress. That's say it's meanness to split the archive because all (or most of) the archive parts are waiting for that slow-speed images.

    For an extreme case, here is a gallery that have 1000 images and each image is 1 MB, and the split size is 100MB, if the images of 100n + 2 (2.jpg, 102.jpg, 202.jpg...) is downloading, no any part will be saved because all the parts are waiting for THAT image.


    If we use dynamic split, that prefer to store the succeed images -- just like an operating system schedule policy named FCFS (first come, first service) -- the order maybe not correct. In this example, it will get the archive of 1.jpg, 2.jpg, 4.jpg first, then when 3.jpg finished it will get the latter archive includes 3.jpg, 5.jpg. That's say you can't open only a part of archives to see the part, because the images maybe also in other parts.

    Also for that extreme case, here is a gallery that have 1000 images, here is a gallery that have 1000 images and each image is 1 MB, and the split size is 100MB, if the images of 100n + 2 (2.jpg, 102.jpg, 202.jpg...) is downloading, all the gallery except the last one will be saved, then after the 10 images finished, they will be all in the last part archive (912.jpg-1000.jpg, 2.jpg, 102.jpg, 202.jpg ... 902.jpg).

  2. ~~When to give the parts of archive? Real-time or after all finished?~~

    Here are two ways to give the parts, compared them here:

    • Real-time
      • Get the archives when it's OK, so that the saved images and the part of archive can be dropped immediately to save memory
      • User should confirm it saves correctly in case of broken saving, it maybe annoying if you are doing something else
    • After all finished
      • Get all archives in one time, so that you don't need always to check it, just check all of them when it all finished
      • At least all the parts of archive should be keep until all of them are saved, so that it still takes some memory to keep the archive

    As the feature is aimed to fixed the memory usage problem, you may prefer the first one. But checking it succeed maybe really annoying. But somehow, if you don't check it in Real-time mode, it'll keep all the archives which just like the After-all-finished mode. Hmmmm...... I think I find the answer, I'll choose the first one, Real-time mode, so the question no need to answer. But if you have another suggestion, feel free to talk.

  3. If the preset split size is not enough?

    For example, you set the split size to 100 MB, but the out-of-memory problem or no-file problem still happens, and you tried some times but still doesn't work. I think it may possible to split the archive to two or more archives, but all the images are dropped (we finally only keeps the data of archives). That means we needs to read the archive again to get all the files in archive, then package them again. But the problem comes, if you have already save some archives, the order of archive will be incorrect. Says you saved part1.zip and part3.zip, but part2.zip needs to split. Will it be part2.zip and part4.zip, or part2.part1.zip and part2.part2.zip, or anything else? That maybe related to first question, is the order really needed? Hmm... maybe we may have other problems, but it's a serious question, it backs us to the beginning and even be an infinite loop.

Maybe just these questions at this time? But... seems that I have solve some of them, so the only question is the first one, Should we need the correct order of images / archives?

If you have other questions about this feature (except when to work it out...QAQ), just reply here and let's discuss about it :-)

ccloli avatar May 23 '17 12:05 ccloli

alternatively, can't we just skip the entire zip part and download the files into a folder one by one, skipping the ram part too? i mean, just optionally

Clydefrosch22 avatar Aug 05 '17 07:08 Clydefrosch22

@Clydefrosch22 It can't be done on the browser environment, it doesn't have write access to your local disks, for security reasons (maybe THAT DEAD IE6 which supports VBS can do it). Maybe it can works when you set your browser to download files without asking where to save them, but that means ALL the images will save to the default download folder directly (without sub folder). If you want to choose where to save, you have to confirm them one by one. Both of them are ugly, right? That's why it has to package them then downloads the zip file, it's just a user script that running on your browser. If it can be run on your desktop like a software, that's not a problem. Maybe you should try this forked version: https://github.com/8qwe24657913/E-Hentai-Downloader-NW.js/tree/v0.13+

ccloli avatar Aug 19 '17 14:08 ccloli

@ccloli While websites still don't have direct access to local discs they can write files to the disk by downloading, then, while using Blobs all data is stored in memory with new features such as the Streams API this data can be handled as a stream, saving memory. There is a library, StreamSaver.js, that helps handling it.

qgustavor avatar Jul 28 '19 19:07 qgustavor