anything-llm icon indicating copy to clipboard operation
anything-llm copied to clipboard

[BUG]: Bulk Link Scraper not wokring.

Open amitshuklabag opened this issue 1 year ago • 2 comments

How are you running AnythingLLM?

Docker (remote machine)

What happened?

While using the Bulk Link Scraper it shows the below error:

app-anything-llm-1 | [collector] error: Failed to get page links from https://melbadigital.fi/. Error: Could not find Chrome (ver. 119.0.6045.105). This can occur if either app-anything-llm-1 | 1. you did not perform an installation before running the script (e.g. npm install) or app-anything-llm-1 | 2. your cache path is incorrectly configured (which is: /root/.cache/puppeteer). app-anything-llm-1 | For (2), check out our guide on configuring puppeteer at https://pptr.dev/guides/configuration. app-anything-llm-1 | at ChromeLauncher.resolveExecutablePath (file:///app/collector/node_modules/puppeteer-core/lib/esm/puppeteer/node/ProductLauncher.js:260:27) app-anything-llm-1 | at ChromeLauncher.executablePath (file:///app/collector/node_modules/puppeteer-core/lib/esm/puppeteer/node/ChromeLauncher.js:197:25) app-anything-llm-1 | at ChromeLauncher.computeLaunchArguments (file:///app/collector/node_modules/puppeteer-core/lib/esm/puppeteer/node/ChromeLauncher.js:91:37) app-anything-llm-1 | at async ChromeLauncher.launch (file:///app/collector/node_modules/puppeteer-core/lib/esm/puppeteer/node/ProductLauncher.js:53:28) app-anything-llm-1 | at async PuppeteerWebBaseLoader._scrape (/app/collector/node_modules/langchain/dist/document_loaders/web/puppeteer.cjs:42:25) app-anything-llm-1 | at async PuppeteerWebBaseLoader.load (/app/collector/node_modules/langchain/dist/document_loaders/web/puppeteer.cjs:74:22) app-anything-llm-1 | at async getPageLinks (/app/collector/utils/extensions/WebsiteDepth/index.js:51:18) app-anything-llm-1 | at async discoverLinks (/app/collector/utils/extensions/WebsiteDepth/index.js:22:22) app-anything-llm-1 | at async websiteScraper (/app/collector/utils/extensions/WebsiteDepth/index.js:151:25) app-anything-llm-1 | at async /app/collector/extensions/index.js:124:29

Are there known steps to reproduce?

Deploy the AnythingLLM using your docker image mintplexlabs/anythingllm with the latest or 1.1.1 tag.

Go to Data connection -> Bulk Link Scraper -> Website URL -> pass it https://melbadigital.fi/ -> click the button "Submit"

You will get the same error.

Here is my docker-compose: `version: "3.3"

services: anything-llm: user: 0:0 image: mintplexlabs/anythingllm:latest restart: always cap_add: - SYS_ADMIN volumes: - "./docker.env:/app/server/.env" - "./storage:/app/server/storage" - "./collector/hotdir/:/app/collector/hotdir" - "./collector/outputs/:/app/collector/outputs"
ports: - "172.17.0.1:52840:3001" env_file: - ./docker.env extra_hosts: - "xxxxxx:xxxxxxxxxxxx"

volumes: storage: driver: local driver_opts: type: none device: ${PWD}/storage o: bind`

amitshuklabag avatar Oct 01 '24 07:10 amitshuklabag

  • "./collector/hotdir/:/app/collector/hotdir"
  • "./collector/outputs/:/app/collector/outputs"

Those mounts are not needed and could be messing with how the Puppeteer cache location is found.

Attempting to replicate:

  • docker pull mintplexlabs/anythingllm
export STORAGE_LOCATION=$HOME/anythingllm && \
mkdir -p $STORAGE_LOCATION && \
touch "$STORAGE_LOCATION/.env" && \
docker run -d -p 3001:3001 \
--cap-add SYS_ADMIN \
-v ${STORAGE_LOCATION}:/app/server/storage \
-v ${STORAGE_LOCATION}/.env:/app/server/.env \
-e STORAGE_DIR="/app/server/storage" \
mintplexlabs/anythingllm

Open workspace, go to data connector, enter bulk link scrape with website - get the following:

[collector] info: Discovering links...
[collector] info: Found 20 links to scrape.
[collector] info: Starting bulk scraping...
[collector] info: Scraping 1/20: https://melbadigital.fi
[collector] info: Successfully scraped https://melbadigital.fi.
[collector] info: Scraping 2/20: https://melbadigital.fi/
.....

timothycarambat avatar Oct 01 '24 16:10 timothycarambat

Based on your suggestion I have tried it with docker-compose, but the issues are now different.

version: "3.8"
services:
  anything-llm:
    image: mintplexlabs/anythingllm
    restart: always
    cap_add:
      - SYS_ADMIN
    user: "0:0"   # Run as root to avoid permission issues
    volumes:
      - "./storage:/app/server/storage"     
      - "./docker.env:/app/server/.env"     
    environment:
      - STORAGE_DIR="/app/server/storage"   
    ports:
      - "172.17.0.1:3001:3001"

It returns the error: "Error: Response could not be completed"

But when I try it with

docker run -d -p 172.17.0.1:3001:3001 \
--cap-add SYS_ADMIN \
-v ./storage:/app/server/storage \
-v ./docker.env:/app/server/.env \
-e STORAGE_DIR="/app/server/storage" \
--restart always \
mintplexlabs/anythingllm

it works fine.

So, could you please check and let me know what was wrong in my docker-compose, or maybe you can try it with the docker-compose file and send me which one is working.

amitshuklabag avatar Oct 03 '24 10:10 amitshuklabag

It is likely just this line in your compose:

user: "0:0"

You should not be running the whole image as root, since this is insecure for chromium, thus fails to scrape any websites via puppeteer

timothycarambat avatar Oct 03 '24 18:10 timothycarambat

Thanks for all of your help, it's working now.

amitshuklabag avatar Oct 04 '24 16:10 amitshuklabag