bulk-downloader-for-reddit
bulk-downloader-for-reddit copied to clipboard
[BUG] Download order of submissions from id-file is non-predictable
- [x] I am reporting a bug/flaw.
- [x] I am running the latest version of BDfR
- [x] I have read the Opening an issue
Description
I'm downloading submissions with the parameter --include-id-file
. When I see an ID in the output and look it up in the file, I expect that it gives me the current position and I can make an estimation about the remaining IDs/time.
Unfortunately, the submissions aren't downloaded in the order they are given in the file.
I see no reason why it shouldn't be. For exclude files the order is irrelevant, but for include files the user somehow expects it.
I can provide a PR for it, would need someone to write test code for it, if it's required.
The reason for this is that we use sets to determine what should and shouldn't be downloaded. Sets are, by their nature, unordered, but they make not having duplicates and checking inclusion much faster than the other options, especially where the numbers of IDs to work with are quite large. I'm not sure that changing this would be an improvement for the benefit of downloading IDs from a file in order.
Yes, I saw that sets are used for IDs, in one place, for storing the union of excluded IDs (connector.py#L83) specified by exclude file and command line. The included IDs specified by file are added to a list (connector.py#L88), anyway.
For sets and the in
operator there's a performance gain, but memory penalty. But as it's a downloader and not a math app, I don't think the user will see any difference in performance but possibly in memory usage. Even with hundred thousands IDs a list of included IDs made no problems here.