bulk-downloader-for-reddit icon indicating copy to clipboard operation
bulk-downloader-for-reddit copied to clipboard

[BUG] Download order of submissions from id-file is non-predictable

Open thomas694 opened this issue 1 year ago • 2 comments

  • [x] I am reporting a bug/flaw.
  • [x] I am running the latest version of BDfR
  • [x] I have read the Opening an issue

Description

I'm downloading submissions with the parameter --include-id-file. When I see an ID in the output and look it up in the file, I expect that it gives me the current position and I can make an estimation about the remaining IDs/time. Unfortunately, the submissions aren't downloaded in the order they are given in the file.

I see no reason why it shouldn't be. For exclude files the order is irrelevant, but for include files the user somehow expects it.

I can provide a PR for it, would need someone to write test code for it, if it's required.

thomas694 avatar Apr 30 '23 20:04 thomas694

The reason for this is that we use sets to determine what should and shouldn't be downloaded. Sets are, by their nature, unordered, but they make not having duplicates and checking inclusion much faster than the other options, especially where the numbers of IDs to work with are quite large. I'm not sure that changing this would be an improvement for the benefit of downloading IDs from a file in order.

Serene-Arc avatar Jun 01 '23 07:06 Serene-Arc

Yes, I saw that sets are used for IDs, in one place, for storing the union of excluded IDs (connector.py#L83) specified by exclude file and command line. The included IDs specified by file are added to a list (connector.py#L88), anyway.

For sets and the in operator there's a performance gain, but memory penalty. But as it's a downloader and not a math app, I don't think the user will see any difference in performance but possibly in memory usage. Even with hundred thousands IDs a list of included IDs made no problems here.

thomas694 avatar Jun 02 '23 21:06 thomas694