cdx_toolkit icon indicating copy to clipboard operation
cdx_toolkit copied to clipboard

Retrieving objects for a set or list of URL's in parallel

Open vikas95 opened this issue 4 years ago • 3 comments

Hi,

Thanks for sharing the programming example - https://github.com/cocrawler/cdx_toolkit#programming-example I wanted to ask if there is a way to feed in a list of URL's and retrieve their objects. We feed URL's one by one in the above example and looping over a few thousands (or even hundreds) seems to be a little time consuming.

Thanks.

vikas95 avatar Dec 10 '21 18:12 vikas95

The loop could be one iteration... in fact the example you're looking at just loops once (limit=1)

wumpus avatar Dec 13 '21 20:12 wumpus

@wumpus - thanks for the response :D I am trying to retrieve meta data for nearly 10k webpages. I am feeding URL of each webpage one by one to the cdx.iter function. I am observing the retrieval time for sets of 20 webpages. Some of the sets take nearly 30 mins while other set (of same size of 20 webpages) are retrieved within 5 mins.

I read your explanation on another issue on this repo (https://github.com/cocrawler/cdx_toolkit/issues/8). I wanted to ask if the retrieval time is dependent on how many requests are given to cc at a specific time ? And it would be helpful if you can suggest any changes that can help in speeding up the retrieval time.

Thanks.

vikas95 avatar Mar 17 '22 05:03 vikas95

Turn up the verbose level and you'll see what's going on -- if you are not limiting your time span, the cdx code has to talk to every Common Crawl index individually. Whereas for the Internet Archive, there's just one query.

wumpus avatar Mar 26 '22 18:03 wumpus