torrust-tracker
torrust-tracker copied to clipboard
Tracker restarted when memory is full
Relates to: https://github.com/torrust/torrust-tracker/issues/567
We are running the live demo in a droplet with 1GB of RAM.
The tracker domain is https://tracker.torrust-demo.com/health_check.
On the 29th of December I changed the tracker configuration from
remove_peerless_torrents = true
To:
remove_peerless_torrents = false
With this change, the tracker does not clean up torrents without peers. That makes the data structure containing the torrent data grow indefinitely.
That makes the process to run out of memory every 5 hours approx:
I guess, the Tracker container crashes and it's restarted again, otherwise I suppose the process would simply die.
We should limit the memory usage. This could happen even with the option remove_peerless_torrents enabled.
cc @WarmBeer @da2ce7
Last 24 hours
Last 7 days
Hi @WarmBeer I've been thinking about this problem. I want to share some thoughts.
Is the tracker actually crashing?
The tracker was restarted on the demo environment but I suppose it was because of the container healthcheck. we are using this HEALTHCHECK instruction:
HEALTHCHECK --interval=5s --timeout=5s --start-period=3s --retries=3 \
CMD /usr/bin/http_health_check http://localhost:${HEALTH_CHECK_API_PORT}/health_check \
|| exit 1
And this compose configuration:
tracker:
image: torrust/tracker:develop
container_name: tracker
tty: true
restart: unless-stopped
environment:
- USER_ID=${USER_ID}
- TORRUST_TRACKER_DATABASE=${TORRUST_TRACKER_DATABASE:-sqlite3}
- TORRUST_TRACKER_DATABASE_DRIVER=${TORRUST_TRACKER_DATABASE_DRIVER:-sqlite3}
- TORRUST_TRACKER_API_ADMIN_TOKEN=${TORRUST_TRACKER_API_ADMIN_TOKEN:-MyAccessToken}
networks:
- backend_network
ports:
- 6969:6969/udp
- 7070:7070
- 1212:1212
volumes:
- ./storage/tracker/lib:/var/lib/torrust/tracker:Z
- ./storage/tracker/log:/var/log/torrust/tracker:Z
- ./storage/tracker/etc:/etc/torrust/tracker:Z
logging:
options:
max-size: "10m"
max-file: "10"
Notice the restart attribute. See https://github.com/compose-spec/compose-spec/blob/master/spec.md#restart
It might be that the tracker continues allocating more memory (using swap) without panicking as you mentioned yesterday in the meeting. It could be the case that the container is restarted due to the compose configuration because the container becomes unhealthy.
I have not checked it but if you want you can check it using docker. You can run the tracker limiting the memory with:
docker run -it -m 500m torrust/tracker:develop
docker exec -it CONTAINER_ID /bin/sh
# free
total used free shared buff/cache available
Mem: 64929416 10815272 10446528 1200488 43667616 52189600
Swap: 8388604 0 8388604
NOTE: I do not know why the free command shows 64929416 (61.9215m).
You can set the option remove_peerless_torrents = false and make hundreds of announce requests until you use the 500m.
Anyway it seems (from what I've read in a quick research) that Rust panics when It can't allocate more memory. I've also seen a way to "capture" that event:
use std::alloc::set_alloc_error_hook;
fn main() {
set_alloc_error_hook(|| {
// Custom action on allocation error, like logging
std::process::abort();
});
// Rest of your code
}
Limit memory consumption by limiting concurrent request
You are now working on this
- https://github.com/torrust/torrust-tracker/issues/567
you are controlling the amount of used memory and deleting torrents when you reach the limit.
I've been thinking about an alternative. Instead of directly controlling memory consumption we could control concurrent requests. In theory if we limit the number of concurrent requests that would indirectly limit the amount of memory used.
We could check the processing time for each request and set a maximum time. When we go over the maximum response time we can start rejecting new requests. It could be something similar to what @da2ce7 did by limiting active requests to 50. However, we could set that limit dynamically. In theory, if we stop accepting new requests the memory consumption should not increase.
This proposal has other advantages:
- It work also with CPU consumption or any other resource because the metric is the response time.
- We can handle overload gracefully (graceful degradation).
- We could implement many policies like prioritising the peers that came first to the swarm.
- We could also implement other measures to avoid DoS attacks.
- We could reject requests from peers that announce themselves too often (less than the min interval).
- We don't need to calculate the memory consumption. That could decrease the performance.
Disadvantages:
- When the service degrades, it affects to all peers the same way no matter how long they have been using the service or how well they have been using it. For example, peers announcing too often can cause problems with peers behaving well.
- We remove peerless torrents. If we limit the number of requests torrents will be removed randomly depending on the request we accept. It might be the case that we always reject requests for the most popular torrents. However, I suppose, requests would be randomly rejected so anyway the most popular torrents would have more peers and it would be more likely for them to stay in the repository.
- I suppose response time would be measured in the service layer (HTTP or UDP handler) not in the repository or domain service because we need to adjust the "demand" to the "offer". But maybe that's not a bad thing. We can have multiple HTTP and UDP trackers and they would adjust their request load depending on the global resources consumption by all services. I suppose all services would be automatically balanced.
But this would work only if reducing the load means reducing memory consumption.
The question is: can we increase the memory consumption limiting the number of requests?
In theory, peers should announce themselves every 2 minutes. We can have different scenarios for normal cases (that means peers behaving well) like:
- Many peers announcing the same torrent
- Many peers announcing different torrents
If we limit the concurrent request to 1 per second we have in the torrent repository:
- One torrent with 120 peers
- 120 torrents with one peer
Assuming:
- peer_size
- torrent_size = stats_size + (peer_size * number_of_peers)
With 1 request per second (type 1):
- size = (torrent_size + (peer_size * 120)) * 1
With 1 request per second (type 2):
- size = (torrent_size + (peer_size * 1)) * 120
With N requests per second (type 1):
- size = ((torrent_size + (peer_size * 120 * N)) * 1)
With N requests per second (type 2):
- size = (torrent_size + (peer_size * 1)) * 120 * N
If we find out the worst scenario (the one that consumes more memory) we can limit the number of concurrent requests or indirectly limit the concurrent requests when the response time is high.
This approach assumes:
- Peers only announce themselves once every 120 seconds (adjusted to the min interval in the settings).
- We remove peerless torrents, otherwise, the memory consumption will grow anyway. This solution is based on the fact that we clean torrents without peers. And if we limit the number of requests some torrents are going to become peerless. If we limit the amount of requests only some amount of torrents will be updated, the ones that are not updated will be removed.
Does the option remove_peerless_torrents = false make sense?
Since we are now limiting memory consumption, maybe that option does not make sense anymore because memory consumption would grow indefinitely.
Does it even make sense? Why do we want to keep torrents without peers @WarmBeer? If you want to keep a list of torrents we already have persisted stats.
What do you think @WarmBeer @da2ce7?
I'm going to close this issue. I assume the tracker was restarted after the docker healthcheck.
We could apply memory consumption limits in the future if we consider it useful but for other reasons.
The system should only degrade the more memory consumed.