bee icon indicating copy to clipboard operation
bee copied to clipboard

Running out of disk space

Open alsakhaev opened this issue 4 years ago • 57 comments

Summary

We have a bunch of troubles running our bee node serving the swarm downloader (out hackathon project).

  1. We a running a bee node under https://swarm.dapplets.org
  2. the node takes all available space on HDD and obviously starts rejecting files we are uploading. Waiting for swarm hash either fails immediately or takes too long time. We have set a db-capacity: 2621440 chunks (aprox. 10gb) + 5GB freespace, but goes fully consumed.

Steps to reproduce

  1. Created VPS server in Hetzner with following hardware (CX11, 1 VCPU, 2 GB RAM, 20 GB) with Ubuntu 20.04.2 LTS
  2. Installed Bee via wget https://github.com/ethersphere/bee/releases/download/v0.5.0/bee_0.5.0_amd64.deb sudo dpkg -i bee_0.5.0_amd64.deb
  3. Configured like in the config bellow
  4. Installed nginx web-server and configured reverse proxy from https://swarm.dapplets.org to http://localhost:1633 with SSL of let's encrypt
  5. Upload files to the node via POST https://swarm.dapplets.org/files/
  6. After a while disk space runs out

Expected behavior

I expect to see 5gb freespace :)

Actual behavior

  1. Disk space runs out
  2. in the log a lot of errors about it
  3. cannot upload a file, node responses HTTP 500 internal server error

Config /etc/bee/bee.yaml

Uncommented lines from config file:

api-addr: 127.0.0.1:1633
clef-signer-endpoint: /var/lib/bee-clef/clef.ipc
config: /etc/bee/bee.yaml
data-dir: /var/lib/bee
db-capacity: 2621440
gateway-mode: true
password-file: /var/lib/bee/password
swap-enable: true
swap-endpoint: https://rpc.slock.it/goerli

alsakhaev avatar Feb 15 '21 16:02 alsakhaev

Thank you for reporting the bug! We will have a look into it shortly

Eknir avatar Feb 20 '21 12:02 Eknir

Tangential suggestion: It should be pretty easy to calculate an approximate estimate of the maximum possible disk usage from db-capacity and compare that to the actual disk space available. If these are grossly out-of-whack then bee should log a suggestion of a new value for db-capacity that will not exceed the available disk space.

jpritikin avatar Feb 21 '21 13:02 jpritikin

I can reproduce this problem. It looks like disk space accounting does not include uploaded files. When I restart swarm, immediately a ton of disk space is freed up as db-capacity is re-applied.

jpritikin avatar Feb 22 '21 11:02 jpritikin

Hm, bee isn't releasing all of the disk space even after a restart,

root@salvia /o/bee# grep db-cap /etc/bee/bee.yaml
db-capacity: 5000000
root@salvia /o/bee# ls
keys/  localstore/  password  statestore/
root@salvia /o/bee# du -h -s .
111G	.

jpritikin avatar Feb 22 '21 14:02 jpritikin

  • Can you say if at any point your db capacity was set above 5mil?
  • Did you play around with the size? Bee will not garbage collect your uploaded content before it is fully synced. You can track progress of your uploads with the tags API.

Please try to give as much information as to what you have done prior to this problem surfacing. I'm trying to reproduce this but so far no luck.

acud avatar Feb 25 '21 16:02 acud

Can you say if at any point your db capacity was set above 5mil?

Yes, I tried 10mil. Once I realized that disk space management wasn't working then I reduced back to 5mil.

Did you play around with the size?

On one node, I probably uploaded faster than sync'ing. For example, maybe I uploaded 30G of data to the node very quickly and then waited for it to sync.

I'm trying to reproduce this but so far no luck.

If you can provide some guidance about how to not trigger the issue then that would also help. I gather that I shouldn't mess with the db-capacity setting. Also, I should not uploaded too fast?

I was trying to find where the limits were, to help with testing, but I am content to play within expected user behavior too.

I'm curious to hear from @alsakhaev too

jpritikin avatar Feb 25 '21 16:02 jpritikin

@Eknir @acud

message from bee-support

mfw78: I've found on 3x containers that I've run, all of them do not respect the db-capacity limit.

sig: are you uploading any data to them?

mfw78: No

significance avatar Mar 16 '21 09:03 significance

+1: started a node on raspi with 32gb sd card, ran out of disk space after 10hrs

RealEpikur avatar Mar 16 '21 20:03 RealEpikur

+1: have set up docker-based nodes and all of their localstores have easily surpassed the db-capacity limit and use between 30Gb and 40Gb now

ronald72-gh avatar Mar 16 '21 21:03 ronald72-gh

+1: Running multiple bees in Kubernetes containers. Each bee exhausts it's disk space allocation (doubling the db capacity has no effect besides chewing more space, and consequently exceeding).

mfw78 avatar Mar 16 '21 23:03 mfw78

Thanks all, for the comments and reports. We are releasing soon and included several improvements that aim to address this issue. We would greatly appreciate if you could try it out and report back here.

Eknir avatar Mar 23 '21 08:03 Eknir

I can confirm that running 0.5.3 the db-capacity seems to be more respected, with 6 nodes that I'm running doing the following disk usage: 28G / 21G / 28G / 28G / 29G / 27G

mfw78 avatar Mar 29 '21 03:03 mfw78

This issue can be reliably reproduced with a rPI

Eknir avatar Apr 06 '21 07:04 Eknir

截屏2021-04-08 下午1 00 00 I am running 0.5.3 using the default db-capacity. I can see the bee is doing garbage collecting in the same time of consuming more space. Once the garbage collecting fall off and the disk usage reached 100% then everything not working anymore. The log will keep reporting No Space Left and the garbage collecting will also stop to work.

luowenw avatar Apr 08 '21 05:04 luowenw

@zelig @acud you guys are working on this as part of the postage stamps? Shall I assign this issue to the current sprint?

Eknir avatar Apr 13 '21 10:04 Eknir

@zelig @acud you guys are working on this as part of the postage stamps? Shall I assign this issue to the current sprint?

Eknir avatar Apr 13 '21 10:04 Eknir

the bug has a severe impact on the entire network because people are just purging the localstore of their nodes, causing data loss. No way to release the bee without killing the bug.

ethernian avatar Apr 18 '21 18:04 ethernian

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Nov 30 '21 01:11 github-actions[bot]

Any news on this? Issue is still there. I'm using default configuration on disk space (BEE_CACHE_CAPACITY=1000000), it should be ~4GB, but this is my disk space graphic.

Immagine 2022-01-22 214259

I didn't perform any upload on node. It's a VERY important issue to fix.

tmm360 avatar Jan 22 '22 20:01 tmm360

It should be resolved with the latest release. However the problem is multi tiered so shipping a database migration that would fix the problem which is already exacerbated on some nodes was not trivial to do. If you db nuke your node and allow it to resync, the problem should be resolved.

acud avatar Feb 16 '22 16:02 acud

Any plans to publish guidance on this? In particular, how to detect if the issue exists within a node so that we don't just start nuking everything and dropping retrievability on chunks already stored in the swarm.

ldeffenb avatar Feb 16 '22 17:02 ldeffenb

I've db nucked two of my nodes, let's see how it will evolve.

tmm360 avatar Feb 17 '22 00:02 tmm360

@ldeffenb @tmm360 do you still experience this issue?

agazso avatar Mar 22 '22 15:03 agazso

Disk usage seems stable and not growing. Yesterday I installed bee 1.5.0-dda5606e. The sharky migration finished, but disk consumption doubled. How can I delete the old database and expire the blocks that shouldn't be stored?

jpritikin avatar Mar 22 '22 16:03 jpritikin

I deleted the old database using bee nuke. Two weeks ago, disk usage was back to zero. As I write, disk usage is back up to 30GiB.

jpritikin avatar Apr 06 '22 19:04 jpritikin

Good amount of traffic today. Disk usage is up to 34.8GiB.

jpritikin avatar Apr 06 '22 21:04 jpritikin

I upgraded to 1.5.1 today. Disk usage is up to 47.9GiB. At this rate, I'll have to nuke my db again in a few weeks.

jpritikin avatar Apr 07 '22 20:04 jpritikin

If you do the following command, substituting the proper IP and debug port, what value is displayed? It should be 2 or 3 on testnet and 8 or 9 on mainnet.

curl http://127.0.0.1:1635/topology | jq .depth

ldeffenb avatar Apr 07 '22 20:04 ldeffenb

I'm on mainnet. Currently it says 6

jpritikin avatar Apr 07 '22 20:04 jpritikin

Are you sure you have inbound connections open and forwarded to your p2p-addr (default 1634)? With a depth of only 6, it seems that you may not be receiving inbound connections. A shallower depth may cause your node to believe it needs to store more chunks as the neighborhood is larger.

ldeffenb avatar Apr 07 '22 20:04 ldeffenb