browsertrix-crawler icon indicating copy to clipboard operation
browsertrix-crawler copied to clipboard

Out of disk space despite having enough disk space

Open pato-pan opened this issue 7 months ago • 2 comments

docker run -p 9037:9037 -v /drives/HD/Games/Browsertrix/:/crawls/ webrecorder/browsertrix-crawler crawl --url "https://html-classic.itch.zone/html/8940552/index.html" --generateWACZ --text --collection Games
docker run -p 9037:9037 -v /drives/HD/Games/Browsertrix/:/crawls/ webrecorder/browsertrix-crawler crawl --url "https://html-classic.itch.zone/html/8940552/index.html" --generateWACZ --text --collection Games --diskUtilization 0
docker run -p 9037:9037 -v /drives/HD/Games/Browsertrix/:/crawls/ webrecorder/browsertrix-crawler crawl --url "https://html-classic.itch.zone/html/8940552/index.html"
{"timestamp":"2025-07-07T13:18:17.072Z","logLevel":"fatal","context":"general","message":"Out of disk space, exiting. Quitting","details":{}}

I don't know how to get more information on this. I tried

--logging stats, js errors, debug --loglevel --context "general", "worker", "recorder", "recorderNetwork", "writer", "state", "redis", "storage", "text", "exclusion", "screenshots", "screencast", "originOverride", "healthcheck", "browser", "blocking", "behavior", "behaviorScript", "behaviorScriptCustom", "jsError", "fetch", "pageStatus", "memoryStatus", "crawlStatus", "links", "sitemap", "wacz", "replay", "proxy"

But that doesn't show any additional info on the terminal and it sends it to a log file that doesn't exist on my computer. I do have a lot of my disk being used, but I do also still have a lot of free space

df -h
Filesystem              Size  Used Avail Use%
/dev/sda1                13T   13T  178G  99%

How can I override this disk space limit? --diskUtilization 0 didn't work

pato-pan avatar Jul 07 '25 13:07 pato-pan

Hi @pato-pan , on the latest Browsertrix Crawler releases (since 1.6.3), the disk utilization check should be disabled by default.

It looks like you're hitting a related but different check during the crawler's bootstrap phase, where it checks to ensure the disk has space before starting. It looks like that method currently checks to see if disk utilization reported by df is at 99% or above and fails the crawl if so, which is why it is failing in your case.

That implementation may be a bit too naive given the possibility of a case like yours where the remaining 1% of space actually still contains quite a lot of storage to use.

tw4l avatar Jul 07 '25 14:07 tw4l

Hi @pato-pan , on the latest Browsertrix Crawler releases (since 1.6.3), the disk utilization check should be disabled by default.

It looks like you're hitting a related but different check during the crawler's bootstrap phase, where it checks to ensure the disk has space before starting. It looks like that method currently checks to see if disk utilization reported by df is at 99% or above and fails the crawl if so, which is why it is failing in your case.

That implementation may be a bit too naive given the possibility of a case like yours where the remaining 1% of space actually still contains quite a lot of storage to use.

Thanks for figuring out why this is happening. Hopefully a better implementation to determine that the disk is full can be used. I don't know anything about typescript so I don't have any suggestions. I believe it should instead check for how many actual bytes are free, like say 1GB of free disk usage. Also to hopefully share how much free space you have in the error message or what the limit is, because I was struggling earlier thinking docker or browsertrix believed my disk space was full because it didn't have permissions on my folder. Some kind of "fails to write, then disk is full"

pato-pan avatar Jul 07 '25 16:07 pato-pan