[BUG]: Bulk Link Scraper not wokring.
How are you running AnythingLLM?
Docker (remote machine)
What happened?
While using the Bulk Link Scraper it shows the below error:
app-anything-llm-1 | [collector] error: Failed to get page links from https://melbadigital.fi/. Error: Could not find Chrome (ver. 119.0.6045.105). This can occur if either app-anything-llm-1 | 1. you did not perform an installation before running the script (e.g. npm install) or app-anything-llm-1 | 2. your cache path is incorrectly configured (which is: /root/.cache/puppeteer). app-anything-llm-1 | For (2), check out our guide on configuring puppeteer at https://pptr.dev/guides/configuration. app-anything-llm-1 | at ChromeLauncher.resolveExecutablePath (file:///app/collector/node_modules/puppeteer-core/lib/esm/puppeteer/node/ProductLauncher.js:260:27) app-anything-llm-1 | at ChromeLauncher.executablePath (file:///app/collector/node_modules/puppeteer-core/lib/esm/puppeteer/node/ChromeLauncher.js:197:25) app-anything-llm-1 | at ChromeLauncher.computeLaunchArguments (file:///app/collector/node_modules/puppeteer-core/lib/esm/puppeteer/node/ChromeLauncher.js:91:37) app-anything-llm-1 | at async ChromeLauncher.launch (file:///app/collector/node_modules/puppeteer-core/lib/esm/puppeteer/node/ProductLauncher.js:53:28) app-anything-llm-1 | at async PuppeteerWebBaseLoader._scrape (/app/collector/node_modules/langchain/dist/document_loaders/web/puppeteer.cjs:42:25) app-anything-llm-1 | at async PuppeteerWebBaseLoader.load (/app/collector/node_modules/langchain/dist/document_loaders/web/puppeteer.cjs:74:22) app-anything-llm-1 | at async getPageLinks (/app/collector/utils/extensions/WebsiteDepth/index.js:51:18) app-anything-llm-1 | at async discoverLinks (/app/collector/utils/extensions/WebsiteDepth/index.js:22:22) app-anything-llm-1 | at async websiteScraper (/app/collector/utils/extensions/WebsiteDepth/index.js:151:25) app-anything-llm-1 | at async /app/collector/extensions/index.js:124:29
Are there known steps to reproduce?
Deploy the AnythingLLM using your docker image mintplexlabs/anythingllm with the latest or 1.1.1 tag.
Go to Data connection -> Bulk Link Scraper -> Website URL -> pass it https://melbadigital.fi/ -> click the button "Submit"
You will get the same error.
Here is my docker-compose: `version: "3.3"
services:
anything-llm:
user: 0:0
image: mintplexlabs/anythingllm:latest
restart: always
cap_add:
- SYS_ADMIN
volumes:
- "./docker.env:/app/server/.env"
- "./storage:/app/server/storage"
- "./collector/hotdir/:/app/collector/hotdir"
- "./collector/outputs/:/app/collector/outputs"
ports:
- "172.17.0.1:52840:3001"
env_file:
- ./docker.env
extra_hosts:
- "xxxxxx:xxxxxxxxxxxx"
volumes: storage: driver: local driver_opts: type: none device: ${PWD}/storage o: bind`
- "./collector/hotdir/:/app/collector/hotdir"
- "./collector/outputs/:/app/collector/outputs"
Those mounts are not needed and could be messing with how the Puppeteer cache location is found.
Attempting to replicate:
docker pull mintplexlabs/anythingllm
export STORAGE_LOCATION=$HOME/anythingllm && \
mkdir -p $STORAGE_LOCATION && \
touch "$STORAGE_LOCATION/.env" && \
docker run -d -p 3001:3001 \
--cap-add SYS_ADMIN \
-v ${STORAGE_LOCATION}:/app/server/storage \
-v ${STORAGE_LOCATION}/.env:/app/server/.env \
-e STORAGE_DIR="/app/server/storage" \
mintplexlabs/anythingllm
Open workspace, go to data connector, enter bulk link scrape with website - get the following:
[collector] info: Discovering links...
[collector] info: Found 20 links to scrape.
[collector] info: Starting bulk scraping...
[collector] info: Scraping 1/20: https://melbadigital.fi
[collector] info: Successfully scraped https://melbadigital.fi.
[collector] info: Scraping 2/20: https://melbadigital.fi/
.....
Based on your suggestion I have tried it with docker-compose, but the issues are now different.
version: "3.8"
services:
anything-llm:
image: mintplexlabs/anythingllm
restart: always
cap_add:
- SYS_ADMIN
user: "0:0" # Run as root to avoid permission issues
volumes:
- "./storage:/app/server/storage"
- "./docker.env:/app/server/.env"
environment:
- STORAGE_DIR="/app/server/storage"
ports:
- "172.17.0.1:3001:3001"
It returns the error: "Error: Response could not be completed"
But when I try it with
docker run -d -p 172.17.0.1:3001:3001 \
--cap-add SYS_ADMIN \
-v ./storage:/app/server/storage \
-v ./docker.env:/app/server/.env \
-e STORAGE_DIR="/app/server/storage" \
--restart always \
mintplexlabs/anythingllm
it works fine.
So, could you please check and let me know what was wrong in my docker-compose, or maybe you can try it with the docker-compose file and send me which one is working.
It is likely just this line in your compose:
user: "0:0"
You should not be running the whole image as root, since this is insecure for chromium, thus fails to scrape any websites via puppeteer
Thanks for all of your help, it's working now.