botasaurus icon indicating copy to clipboard operation
botasaurus copied to clipboard

zombie processes

Open Kaiden0001 opened this issue 1 year ago • 18 comments

Hi requirements.txt

botasaurus==4.0.14
botasaurus_server==4.0.19
cchardet==2.1.7

zombie processes remain after execution

@request
def scrape_heading_task(requests: AntiDetectRequests, botasaurus_request: dict):
    @browser(
        user_agent=botasaurus_request.get("user_agent") or bt.UserAgent.RANDOM,
        window_size=botasaurus_request.get("window_size") or bt.WindowSize.RANDOM,
        max_retry=botasaurus_request.get("max_retry"),
        add_arguments=["--disable-dev-shm-usage", "--no-sandbox", "--headless=new"],
        output=None,
        proxy=botasaurus_request.get("proxy") or None,
        create_driver=create_stealth_driver(
            start_url=botasaurus_request.get("url"),
            raise_exception=True,
            wait=botasaurus_request.get("wait"),
        ),
    )
    def scrape(driver: AntiDetectDriver, data):
        return {"text": driver.page_source, "cookies": driver.get_cookies()}

after 800 requests error:

[Errno 11] Resource temporarily unavailable

or

 ('launch', 'Error: spawnSync /bin/sh EAGAIN\n    at Object.spawnSync (node:internal/child_process:1117:20)\n    at spawnSync (node:child_process:876:24)\n    at execSync (node:child_process:957:15)\n    at findChromeExecutables (file:///usr/local/lib/python3.9/site-packages/javascript_fixes/js/node_modules/chrome-launcher/dist/chrome-finder.js:217:25)\n    at file:///usr/local/lib/python3.9/site-packages/javascript_fixes/js/node_modules/chrome-launcher/dist/chrome-finder.js:103:46\n    at Array.forEach (<anonymous>)\n    at Module.linux (file:///usr/local/lib/python3.9/site-packages/javascript_fixes/js/node_modules/chrome-launcher/dist/chrome-finder.js:102:32)\n    at Launcher.getFirstInstallation (file:///usr/local/lib/python3.9/site-packages/javascript_fixes/js/node_modules/chrome-launcher/dist/chrome-launcher.js:122:43)\n    at Launcher.launch (file:///usr/local/lib/python3.9/site-packages/javascript_fixes/js/node_modules/chrome-launcher/dist/chrome-launcher.js:190:43)\n    at Module.launch (file:///usr/local/lib/python3.9/site-packages/javascript_fixes/js/node_modules/chrome-launcher/dist/chrome-launcher.js:33:20)')

How to make them deleted after requests ?

Kaiden0001 avatar May 29 '24 14:05 Kaiden0001

Dockerfile

FROM chetan1111/botasaurus:latest

ENV PYTHONUNBUFFERED=1

COPY requirements.txt .

RUN python -m pip install -r requirements.txt
RUN apt-get update && apt-get install -y lsof

RUN mkdir app
WORKDIR /app
COPY . /app

CMD ["python", "run.py", "backend"]

Kaiden0001 avatar May 29 '24 14:05 Kaiden0001

Only solution is to upgrade to latest version, with that this error will not occur. Upgrade by python -m pip install bota botasaurus botasaurus_api botasaurus_driver bota botasaurus-proxy-authentication botasaurus_server --upgrade

Chetan11-dev avatar May 30 '24 17:05 Chetan11-dev

on the old version, no way?

Kaiden0001 avatar May 30 '24 18:05 Kaiden0001

You need to use new version to resolve it.

Chetan11-dev avatar May 30 '24 20:05 Chetan11-dev

Same problem on the new version)

requirements.txt

cchardet==2.1.7
botasaurus-requests==4.0.16
bota==4.0.62
botasaurus==4.0.34
botasaurus_api==4.0.4
botasaurus_driver==4.0.30
botasaurus-proxy-authentication==1.0.16
botasaurus_server==4.0.23
deprecated==1.2.14

After every request

root@s# ps -A -ostat,pid,ppid | grep -e '[zZ]'
Z    3388440 3388338
Z    3388441 3388338
Z    3388443 3388338
Z    3388445 3388338
Z    3388450 3388338
Z    3388451 3388338
Z    3388452 3388338
Z    3388630 3388338

And with each request, they increase

Kaiden0001 avatar May 31 '24 06:05 Kaiden0001

  • Code to reproduce it?
  • Which os are you using?

Chetan11-dev avatar May 31 '24 06:05 Chetan11-dev

code

from botasaurus.browser import browser, Driver
from botasaurus.request import request
from botasaurus_driver.user_agent import UserAgent
from botasaurus_driver.window_size import WindowSize


@request
def scrape_heading_task(requests, botasaurus_request: dict):
    @browser(
        block_images_and_css=True,
        user_agent=botasaurus_request.get("user_agent") or UserAgent.RANDOM,
        window_size=botasaurus_request.get("window_size") or WindowSize.RANDOM,
        max_retry=botasaurus_request.get("max_retry"),
        output=None,
        add_arguments=["--disable-dev-shm-usage", "--no-sandbox"],
        proxy=botasaurus_request.get("proxy") or None,
    )
    def scrape(driver: Driver, data):
        driver.google_get(
            link=botasaurus_request.get("url"),
            bypass_cloudflare=bool(botasaurus_request.get("bypass_cloudflare")),
            wait=botasaurus_request.get("wait"),
        )
        return {"text": driver.page_html, "cookies": driver.get_cookies()}

    try:
        return scrape()
    except Exception as e:
        return {"error": str(e)}

Dockerfile

FROM chetan1111/botasaurus:latest

ENV PYTHONUNBUFFERED=1

COPY requirements.txt .

RUN python -m pip install -r requirements.txt
RUN apt-get update && apt-get install -y lsof xvfb

RUN mkdir app
WORKDIR /app
COPY . /app

CMD ["python", "run.py", "backend"]

OS: Ubuntu 22.04 LTS x86_64

Kaiden0001 avatar May 31 '24 06:05 Kaiden0001

  • Are you running it in docker or ubuntu,
  • Also kindly share a sample call to function

Chetan11-dev avatar May 31 '24 07:05 Chetan11-dev

running in docker

scrapers.py

import os

from botasaurus_server.server import Server
from src.scrape_heading_task import scrape_heading_task

Server.rate_limit["browser"] = os.getenv("MAX_BROWSERS", 3)
Server.add_scraper(scrape_heading_task)

scrape_heading_task.js

/**
 * @typedef {import('../../frontend/node_modules/botasaurus-controls/dist/index').Controls} Controls
 */

/**
 * @param {Controls} controls
 */
function getInput(controls) {
    controls.link('url', {isRequired: true})
    controls.text('user_agent', {isRequired: false})
    controls.listOfTexts('window_size', {isRequired: false})
    controls.text('proxy', {isRequired: false})
    controls.number('max_retry', {isRequired: false, defaultValue: 2})
    controls.number('bypass_cloudflare', {isRequired: false, defaultValue: 0})
    controls.number('wait', {isRequired: false, defaultValue: 5})
}

call

        api = Api(server_url)

        data = self.get_data(botasaurus_request)

        task = api.create_async_task(
            data=data,
            scraper_name="scrape_heading_task",
        )
        result = self.get_task_result(
            api,
            task.get("id"),
            botasaurus_request.timeout,
            botasaurus_request.wait,
        )

Kaiden0001 avatar May 31 '24 07:05 Kaiden0001

This issue occurs only in Docker, to resolve it run command

python -m pip install bota botasaurus botasaurus_api botasaurus_driver bota botasaurus-proxy-authentication botasaurus_server --upgrade

With this the zombie processes will be periodically purged, and won't reach more than 10 at any point.

Chetan11-dev avatar May 31 '24 13:05 Chetan11-dev

Unfortunately, I have the same problem. Upgrading to the latest version doesn't really help.

Could someone please provide some information why this is happening and how it's possible to debug this error?

karazonanas avatar Jul 31 '24 00:07 karazonanas

please run python -m pip install bota botasaurus botasaurus-api botasaurus-requests botasaurus-driver bota botasaurus-proxy-authentication botasaurus-server --upgrade if that does not works, please share steps to reproduce error.

Chetan11-dev avatar Jul 31 '24 05:07 Chetan11-dev

Well, I think I was able to reproduce the bug and figure out how to fix it:

To reproduce the bug, you can use the official botasaurus starter project. You need to run the project in a docker container using the docker-compose.yml file. The next step is to run some scraping tasks from the web interface. Finally, you need to run the top command inside your container. You'll probably see some zombie processes from Chrome.

The problem exists in all docker/podman containers due to the PID 1 zombie raping problem.

To solve this problem you could use images as suggested in the article above.

Another solution (which I actually prefer) is to use the --init flag in your docker-compose.yml file, see the docker documentation.

karazonanas avatar Sep 02 '24 14:09 karazonanas

How to use the --init flag?

Chetan11-dev avatar Sep 03 '24 09:09 Chetan11-dev

if you're using docker-compose.yml file, just add init: true to the container which is running botasaurus

karazonanas avatar Sep 03 '24 11:09 karazonanas

Like this:

services:
  bot-1:
    init: true
    restart: "no"
    shm_size: 800m
    build:
      dockerfile: Dockerfile
      context: .
    volumes:
      - .:/app
    ports:
      - "3000:3000"
      - "8000:8000"

Chetan11-dev avatar Sep 03 '24 11:09 Chetan11-dev

Like this:

services:
  bot-1:
    init: true
    restart: "no"
    shm_size: 800m
    build:
      dockerfile: Dockerfile
      context: .
    volumes:
      - .:/app
    ports:
      - "3000:3000"
      - "8000:8000"

exactly

karazonanas avatar Sep 03 '24 11:09 karazonanas

thanks

Chetan11-dev avatar Sep 03 '24 13:09 Chetan11-dev