brozzler Evaluation of brozzler's scalability?

Evaluation of brozzler's scalability?

Open goelayu opened this issue 1 year ago • 4 comments

I am curious if there is any data reporting how well does brozzler scale with increasing the number of parallel browsers? In my current (very limited) test bed, brozzler takes extremely long to crawl web pages and store the corresponding resources.

Attaching some results when I attempt to crawl 20 random web pages with brozzler while enabling headless Chrome browser. Scalability results scale

I also track all the system resource usage (CPU, NW, disk). I am currently running this experiment on a 32 core linux server with 1Gbps NIC and storing data on an underlying hdd with r/w throughput of 150-200MBps disk cpu

As you can see, neither of resources are being saturated, and yet brozzler is taking on avg ~40-50s to crawl and store a single page. Furthermore the low CPU usage is extremely concerning, since in my experience increasing the number of parallel browsers linearly increases the overall CPU usage of the system. This could be due to the proxy server used by brozzler?

Also, when I crawl the same corpus of pages using an extremely lightweight, custom, nodejs based crawler (written on top of puppeteer), it can do so about 10x faster than the above observed timings.

Aug 11 '22 19:08 goelayu

You are you using warcprox for archiving, right? Have you checked its /status endpoint to see the state of the queues? This is probably the bottleneck.
You need to show us the exact python code you are running for your experiment. Maybe you doing something in an sub-optimal way.
Is your "custom nodejs based crawler" using also warcprox for archiving? If not, the comparison is not fair.

Aug 12 '22 12:08 vbanos

Hi, Im new to Brozzler/warcprox and am notice something that could possibly cause a slowdown at scale. This networked pipelined system contains a lot of reads/writes http, warc file io and rethinkdb tcp. This setup will eventually become IO limited at scale and hit GIL thrashing pretty quickly. When digging into the code, im seeing some thread pool executor parallelization usage but it probably wont help much when scaling and in the worst case could cause some race conditions/unexpected behavior. There is limited usage of asyncio and modern concurrent features in later versions of python (in fact only warcprox benchmark script did it fully).

IMAO this framework needs to step up w/ modern concurrent Python patterns and replace thse these IO blocking touchpoints:

urllib requests -> httpx or requests-asyncio
builtin file io open -> aiofiles/anyio
sqlite3 -> aiosqlite
subprocess -> asyncio.subprocess
rethinkdb connections -> ... (im sure theres something out there)

I realize this is a big effort across multiple repos but it would be fairly straightforward to add. adding full async support to python codebases is a big lift compared to nodejs which was built for these use cases

Jul 05 '23 20:07 justquick

Rethinkdb supports asyncio OOTB:

https://github.com/rethinkdb/rethinkdb-python#asyncio-mode

Jul 05 '23 21:07 TheTechRobo

@goelayu are you running one Brozzler worker and trying to scale up the browser pool? Or multiple Brozzler workers? I don't work on Brozzler, but my understanding is that to scale things up, you are supposed to run multiple workers.

Jul 06 '23 13:07 anjackson

brozzler brozzler copied to clipboard

Evaluation of brozzler's scalability?

brozzler
brozzler copied to clipboard