Allow setting a database size limit, suspend DHT crawler if exceeded
- [x] I have checked the existing issues to avoid duplicates
- [x] I have redacted any info hashes and content metadata from any logs or screenshots attached to this issue
Is your feature request related to a problem? Please describe
Since the database size will tend to grow indefinitely as more and more torrents are indexed (does it ever stop? once you've indexed the whole DHT? :thinking: ), I would like to set a limit in the configuration file, after which bitmagnet would stop crawling, and just emit a message in logs on start/every few minutes.
Describe the solution you'd like
dht_crawler:
db_size_limit: 53687091200 # 50GiB in bytes, could also use human-friendly format 50G
[INFO] DHT crawler suspended, configured maximum database size of 53687091200 bytes reached
Describe alternatives you've considered
Manually disabling the DHT crawler component --keys=http_server in my systemd .service file once I get low on disk space. However a configuration setting would be more "set-and-forget" and could prevent running low on space in the first place (after which manual trimming of the database would be needed to reclaim some disk space, I guess).
Additional context
Related to https://github.com/bitmagnet-io/bitmagnet/issues/186 and https://github.com/bitmagnet-io/bitmagnet/issues/70
Should get an aggressive notification somewhere that you've hit such a catastrophic failure, I guess that could be done in Grafana as a default
catastrophic failure
It's not, it's normal operation (respecting configured limits). This deserves an INFO level log message, at worst a WARNING
It should absolutely be a warning, as that's a serious problem! If you run out of space, there's potential for logging to fail, among other "catastrophic" things.
We misunderstand each other.
Hitting the maximum configured database size is fine, it needs an INFO message so that the user/admin is not left wondering "why is it not indexing anymore".
Running out of disk space is bad, but that's not bitmagnet's job to warn you about it, that's a job for your monitoring software. Anyway setting a db size limit in bitmagnet would help preventing this.
Rather than just suspending the DHT crawler, it would be nice to instead start purging old (by some definition of "old") resources from the collection. Assuming that this is technically feasible, it would be similar to a rolling log and can only grow so large before old messages/files are removed. Not to mention, I would expect that there is a inverse relationship between the age of an item in the DHT and the health of the resource.
Just suspending the DHT crawler is fine. Let's not overcomplicate this
old messages/files are removed
I don't expect the application to start losing data without manual intervention. As I said:
manual trimming of the database would be needed to reclaim some disk space, I guess
Database cleanup possibilities can be discussed in another issue.
Could this be done by separating the dht_crawler worker to a separate Docker service, with a custom healthcheck that fails if the size limit is exceeded?
separate Docker service
I don't use Docker, I run the binary as a systemd service, see ansible role here. I could hack together a script that checks the db size, and restarts bitmagnet without the dht_crawler if a certain size is exceeded but... I think the check should be in the application rather than depend on an external mechanism, which feels like a hack.
I guess my concern is any internal behaviour around this could get complex. It isn't currently supported to start/stop/restart individual workers without stopping the whole process - and I don't know if it needs to be, given that some external tool (Docker, Ansible whatever) would be capable of doing this. Database size can go down as well as up, and having this behaviour trigger during something like a migration, where disk usage spikes and then is recovered, could have unintended effects.
a configuration setting would be more "set-and-forget"
I don't know that this could ever be a "set and forget" thing as the worker won't be able to either resolve the disk space issue, or restart itself, so it will require intervention unless you intend it to be stopped forever once a threshold is reached.
Running out of disk space is bad, but that's not bitmagnet's job to warn you about it, that's a job for your monitoring software.
I agree with this - I think some monitoring software (or script) could also capably handle the stopping of the process?
I could hack together a script that checks the db size, and restarts bitmagnet without the dht_crawler if a certain size is exceeded
If going down this route I'd probably separate the crawler to its own process that can be started/stopped independently.
I think I'd need convincing that this use case was enough in-demand, and that some external orchestration would be sub-optimal, to require something implemented internally (even after other upcoming disk space mitigations have been implemented, see below).
As an aside, I'm currently building a custom workflow feature that (among other things) would allow you to auto-delete any torrent based on custom rules. It's not the feature you've described here but it will certainly help with keeping the DB size more manageable by deleting torrents that are stale or that you're not interested in.
It isn't currently supported to start/stop/restart individual workers without stopping the whole process
Thanks, that limits our possibilities indeed.
I think some monitoring software (or script) could also capably handle the stopping of the process? If going down this route I'd probably separate the crawler to its own process that can be started/stopped independently.
I will try to get this working and post the results.
Although I still argue that there should be some way to suspend the DHT crawler (possibly without actually stopping the worker, just have it idle and poll the db size every 60s...) given bitmagnet's tendency to eat up disk space indefinitely (other services with a similar behavior, such as elasticseach will actively avoid storing new data when a certain disk usage threshold is reached)
It isn't currently supported to start/stop/restart individual workers without stopping the whole process
Thanks, that limits our possibilities indeed.
Not to say it never could, I just think we'd need clear use cases and well-defined behaviour. I like the model of running individual workers that can simply be terminated if required for any externally determined reason, unless there's a good reason not to do it this way (partly because for the moment at least, it keeps things simple).