brozzler icon indicating copy to clipboard operation
brozzler copied to clipboard

Evaluation of brozzler's scalability?

Open goelayu opened this issue 1 year ago • 4 comments

I am curious if there is any data reporting how well does brozzler scale with increasing the number of parallel browsers? In my current (very limited) test bed, brozzler takes extremely long to crawl web pages and store the corresponding resources.

Attaching some results when I attempt to crawl 20 random web pages with brozzler while enabling headless Chrome browser. Scalability results scale

I also track all the system resource usage (CPU, NW, disk). I am currently running this experiment on a 32 core linux server with 1Gbps NIC and storing data on an underlying hdd with r/w throughput of 150-200MBps nw disk cpu

As you can see, neither of resources are being saturated, and yet brozzler is taking on avg ~40-50s to crawl and store a single page. Furthermore the low CPU usage is extremely concerning, since in my experience increasing the number of parallel browsers linearly increases the overall CPU usage of the system. This could be due to the proxy server used by brozzler?

Also, when I crawl the same corpus of pages using an extremely lightweight, custom, nodejs based crawler (written on top of puppeteer), it can do so about 10x faster than the above observed timings.

goelayu avatar Aug 11 '22 19:08 goelayu

  1. You are you using warcprox for archiving, right? Have you checked its /status endpoint to see the state of the queues? This is probably the bottleneck.
  2. You need to show us the exact python code you are running for your experiment. Maybe you doing something in an sub-optimal way.
  3. Is your "custom nodejs based crawler" using also warcprox for archiving? If not, the comparison is not fair.

vbanos avatar Aug 12 '22 12:08 vbanos

Hi, Im new to Brozzler/warcprox and am notice something that could possibly cause a slowdown at scale. This networked pipelined system contains a lot of reads/writes http, warc file io and rethinkdb tcp. This setup will eventually become IO limited at scale and hit GIL thrashing pretty quickly. When digging into the code, im seeing some thread pool executor parallelization usage but it probably wont help much when scaling and in the worst case could cause some race conditions/unexpected behavior. There is limited usage of asyncio and modern concurrent features in later versions of python (in fact only warcprox benchmark script did it fully).

IMAO this framework needs to step up w/ modern concurrent Python patterns and replace thse these IO blocking touchpoints:

  • urllib requests -> httpx or requests-asyncio
  • builtin file io open -> aiofiles/anyio
  • sqlite3 -> aiosqlite
  • subprocess -> asyncio.subprocess
  • rethinkdb connections -> ... (im sure theres something out there)

I realize this is a big effort across multiple repos but it would be fairly straightforward to add. adding full async support to python codebases is a big lift compared to nodejs which was built for these use cases

justquick avatar Jul 05 '23 20:07 justquick

Rethinkdb supports asyncio OOTB:

https://github.com/rethinkdb/rethinkdb-python#asyncio-mode

TheTechRobo avatar Jul 05 '23 21:07 TheTechRobo

@goelayu are you running one Brozzler worker and trying to scale up the browser pool? Or multiple Brozzler workers? I don't work on Brozzler, but my understanding is that to scale things up, you are supposed to run multiple workers.

anjackson avatar Jul 06 '23 13:07 anjackson