website-evidence-collector
website-evidence-collector copied to clipboard
Increasing ram usage and tool never finishes.
Steps to reproduce:
Spawn a fresh Ubuntu 20.04 server (no GUI) VPS, install all the tools:
sudo apt update
sudo apt install nodejs -y
sudo apt install npm -y
sudo apt install jq -y
sudo apt install chromium-browser -y
export PUPPETEER_EXECUTABLE_PATH="/usr/bin/chromium-browser" # Fix the "browser not installed" bug, "stolen" from the Dockerfile
npm install --global https://github.com/EU-EDPS/website-evidence-collector/tarball/master
mkdir output_dir
website-evidence-collector --output output_dir/vincentcox.com --json --max 3 https://vincentcox.com --overwrite -- --no-sandbox # Fix the chrome sandbox issue, found somewhere in the issue tracker
It keeps running and it keeps eating resources:
(rip memory)
Note that I am using the latest version from Github and that something might broke it in the Github version. But as explained in this issue (https://github.com/EU-EDPS/website-evidence-collector/issues/41), I cannot access the official download link of the stable version.
Do you have the same behavior when using chromium bundled with the puppeteer node package?
How can I use the puppeteer node package? (sorry, I have little experience with nodeJs).
I installed the latest stable version (mentioned in your reply in my previous issue), it's the same issue.
I have the same in docker, which is using the puppeteer node package.
I removed the versions in the Dockerfile to get it working:
RUN apk add --no-cache \
chromium \
nss \
freetype \
freetype-dev \
harfbuzz \
ca-certificates \
ttf-freefont \
nodejs \
yarn \
It works for you now? Could you prepare a pull request then to help others?
Could you find out which versions you are using instead? I think I decided to fix the version numbers to have a more reproducable setup which is important for auditing.
Sorry, it was not clear in my previous answer. I am trying things out, but they all break if I test them on my website (also for a client, but I don't want to share that one as my website is a good "test" example). So I said that tried docker (using a modified version to get it working), but got the same bug.
In your example you have incuded --max 3
, hence you scan also some other random pages of the same website. Can you please check if with only one page you still have the same behaviour? I would then try to reproduce your problem.
It's unfortunately the same (when using the installed version in my initial post but with --max 1). I gave up on docker because I get this error:
error An unexpected error occurred: "EACCES: permission denied, scandir '/opt/website-evidence-collector/output/browser-profile'".
@rriemann-eu if you need more info to debug let me know!
So when I execute the following two commands, I do not get any error.
website-evidence-collector --output output_dir/vincentcox.com --json --max 1 https://vincentcox.com
website-evidence-collector --output output_dir/vincentcox.com2 --json --max 1 https://vincentcox.com -- --no-sandbox
I am using the latest version from master on opensuse. From the inspection.yml
:
script:
host: mars.fritz.box
version:
npm: 0.4.0
commit: v0.4.0-70-ga956e2d
cmd_args: '--output output_dir/vincentcox.com --json --max 1 https://vincentcox.com'
environment: {}
node_version: v10.22.1
browser:
name: Chromium
version: HeadlessChrome/80.0.3987.0
user_agent: >-
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/72.0.3617.0 Safari/537.36
platform:
name: Linux
version: 5.8.14-1-default
extra_headers: {}
preset_cookies: {}
start_time: 2020-11-24T11:30:47.957Z
end_time: 2020-11-24T11:31:00.650Z
Does your problem occurs with all websites?
Hmmm, might be something with my installation then. I'll go with docker then to avoid further mistakes and debugging time on your side. The dockerfile in the Repo doesn't work anymore.
If I want to build this I get this error:
root@client-testvm:~/test/website-evidence-collector# docker build -t website-evidence-collector .
Sending build context to Docker daemon 3.995MB
Step 1/16 : FROM alpine:edge
---> 003bcf045729
Step 2/16 : LABEL maintainer="Robert Riemann <[email protected]>"
---> Using cache
---> f5d20c7a4860
Step 3/16 : LABEL org.label-schema.description="Website Evidence Collector running in a tiny Alpine Docker container" org.label-schema.name="website-evidence-collector" org.label-schema.usage="https://github.com/EU-EDPS/website-evidence-collector/blob/master/README.md" org.label-schema.vcs-url="https://github.com/EU-EDPS/website-evidence-collector" org.label-schema.vendor="European Data Protection Supervisor (EDPS)" org.label-schema.license="EUPL-1.2"
---> Using cache
---> 16ece18d66c6
Step 4/16 : RUN apk add --no-cache chromium~=80.0.3987 nss freetype freetype-dev harfbuzz ca-certificates ttf-freefont nodejs yarn~=1.22.4 bash procps drill coreutils libidn curl parallel jq grep aha
---> Running in 5ca2fe0d3cde
fetch https://dl-cdn.alpinelinux.org/alpine/edge/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/edge/community/x86_64/APKINDEX.tar.gz
ERROR: unsatisfiable constraints:
chromium-86.0.4240.111-r0:
breaks: world[chromium~80.0.3987]
yarn-1.22.10-r0:
breaks: world[yarn~1.22.4]
The command '/bin/sh -c apk add --no-cache chromium~=80.0.3987 nss freetype freetype-dev harfbuzz ca-certificates ttf-freefont nodejs yarn~=1.22.4 bash procps drill coreutils libidn curl parallel jq grep aha' returned a non-zero code: 2
I think this error is caused by this https://superuser.com/a/1486407/1039133
Unfortunately, Alpine-Linux Package Management drops older packages when there are newer versions available. This makes it hard to use Alpine Linux with docker since you want a reproducible image with exact versions.
OK, so I will close this one until we know how to reproduce your problem on other systems. I will open a new issue on the docker problem, which deserves a solution.
Good idea, feel free to tag me in this!
I can confirm this on docker:
It takes a lot of time and keeps using more and more ram.
docker run --rm -it --cap-add=SYS_ADMIN -v $(pwd)/output:/output website-evidence-collector https://vincentcox.com --overwrite
top
:
top - 14:06:18 up 24 days, 1:59, 2 users, load average: 2.61, 1.76, 0.79
Tasks: 121 total, 1 running, 119 sleeping, 0 stopped, 1 zombie
%Cpu(s): 60.1 us, 30.6 sy, 0.0 ni, 7.7 id, 0.0 wa, 0.0 hi, 0.0 si, 1.7 st
MiB Mem : 1994.0 total, 109.3 free, 1602.3 used, 282.4 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 169.5 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4006 ubuntu 20 0 445648 43844 28176 S 95.3 2.1 4:29.16 chrome
4052 ubuntu 20 0 5504516 892576 50024 S 78.7 43.7 4:05.24 chrome
4047 ubuntu 20 0 358264 52200 26516 S 8.3 2.6 0:27.17 chrome
316 root 20 0 14804 4364 1408 S 0.3 0.2 114:20.20 docker-gen
411 root 20 0 10988 3396 2880 R 0.3 0.2 0:00.37 top
1 root 20 0 169324 10212 5544 S 0.0 0.5 1:44.17 systemd
As I do not have this problem on my local computer without docker, I can imagine that it somehow depends on the Chromium version that is used. Maybe newer Chromium versions behave differently than the version HeadlessChrome/80.0.3987.0
I use on my local system.
Yeah the thing is: if it was just on my machine and not on docker it would be something on my side. But even if docker it's giving me the same issue.
With chromium 77.0.3865 (as used in this working dockerfile), it works for me.
Maybe this issue is not even in the scope of this project, but a chromium issue itself. For me it's okay if you guys close it, but keep in mind that other people might face the same issue (in docker or just using it installed on a system). Maybe my website is quite heavy to parse, but it's a standard Wordpress website so I think chances are high people will face the same situation.