internetarchive icon indicating copy to clipboard operation
internetarchive copied to clipboard

Add --start-idx=<n>, --end-idx=<m> option to enable ranged downloads

Open Russtopia opened this issue 2 years ago • 3 comments

Addition of --start-idx=<n> and --end=idx=<m> options for cli download; this both allows downloading specific ranges of items within a collection, and speeds up resumption of incomplete collection downloads immensely, directing the iteration loop to skip (n-1) entries with --start-idx without checking them against locally-downloaded items.

Russtopia avatar Aug 20 '23 23:08 Russtopia

OK, I will look at making all of those changes and re-submit.

In regards to marking beginning and end -- what identifier do you think might be best to use? I know little to nothing about how the IA uses metadata and so on; perhaps there's a specific one that will be constant even as a collection is updated?

(I have already noticed a collection I was testing this on has grown since I began; but so far the older items hadn't moved around so resuming at a specific item number seemed to result in the same one each time.)

Russtopia avatar Aug 24 '23 04:08 Russtopia

As far as I know there is no marker which remains always constant in regard to the sequence. But I also think that it is close to impossible to cover the edge cases here. (e. g. an existing item could also just be replaced)

I would think that using the item's identifier (i. e. the "name" of the item) would at least give a somewhat more reliable outcome when splitting a sequence. But every approach has its own issues. E. g. you could say --start-at=item-abc-1 --end-at=item-xyz-555. But while the start and end item would remain constant with this scheme, you could still miss items. So it's not a perfect solution either.

I guess your current approach is easier to use for your use case.

maxz avatar Aug 24 '23 13:08 maxz

What if you just used --search-parameters to set a sort like so:

ia download --search 'frogs' --search-parameters='sorts=publicdate desc'

And then when you resumed, you could do something like:

ia download --search 'frogs AND publicdate:[YYYY-MM-DD TO *]' --search-parameters='sorts=publicdate desc'

Alternatively, you could use --itemlist (or an itemlist used with GNU Parallel and ia download) and modify your itemlist as necessary on resume.

jjjake avatar Aug 28 '23 18:08 jjjake