JobSpy icon indicating copy to clipboard operation
JobSpy copied to clipboard

Async Support

Open fasihhussain00 opened this issue 9 months ago • 2 comments

I’ve been using this library for scraping tasks and encountered a performance issue that could be resolved with the integration of aiohttp for asynchronous network calls.

Currently, while performing scraping operations, the CPU-intensive tasks get blocked due to synchronous network requests, leading to inefficiencies. One way to improve this is by using aiohttp, which would allow the network calls to be handled asynchronously, thus preventing CPU-bound tasks from being delayed or blocked.

e.g.

async with session.post(url, headers=headers) as response:
    response.raise_for_status()
    raw_data = await response.text()

Moreover, using aiohttp could help avoid the need for creating multiple threads to execute the scraping jobs quickly we can just use event loop, which in turn would reduce the complexity and improve the overall performance of the application.

It would be great if the library could consider switching to or adding an option for aiohttp for the network requests.

Looking forward to hearing from you guys.

fasihhussain00 avatar Feb 08 '25 13:02 fasihhussain00

yep I agree

cullenwatson avatar Feb 09 '25 19:02 cullenwatson

Following this line, I have made initial attempts to implement the AsyncLinkedIn Scraper and found several points to extend this discussion before a long-term plan can be made, given it was labelled as priority:

  1. [This Linkedin API] (https://gist.github.com/Diegiwg/51c22fa7ec9d92ed9b5d1f537b9e1107) "might" have a rate limit about 10 requests within 10s windows per session (rough estimate, correct me if you have a better figure). I would guess, this is why random delay is added between pages (results_wanted > 10). 10 reqs per 10s is barely enough for synchronous requests, but not for asynchronous requests (it is too fast). In that case, the rate limit will be hit very soon. I would say, this is the true performance bottleneck. While we could add retry and back-off mechanism, doing so defeats the purpose of going async, potentially making it slower overall. It would be reasonable to use multiple proxies here to gain the most but no proxies is the default. That being said, it would be nice to also gather information about api rate limit for other scrapers.
  2. It would make sense to work on a dedicated branch if we agree to move it forward.
  3. For the async version, adding option to search multiple cities concurrently would be a benefit, either going for async queue/priority queue or using threads similar to how approach for handling multiple platform searches.

Looking forward to hearing from you.

lixianphys avatar Oct 03 '25 14:10 lixianphys