Expose TcpPortAutoRange in metrics endpoint
Hello folks!
We have a use case in which we'd like to monitor the allocation of available TCP ports to DRBD resources. Since LINSTOR exposes the TcpPortAutoRange controller property to control how to allocate new ports, we'd like to use that information in our Grafana dashboards to determine how close we are to the limit of available TCP ports. With that information, we would be able to anticipate TCP port exhaustion and take the actions described in this guide before errors start happening.
Is adding a new metric to LINSTOR that exposes the upper and lower ranges of the TcpPortAutoRange property feasible? I'm imagining that the end result could look something like this:
linstor_controller_tcp_port_auto_range_start: The start of the range set on theTcpPortAutoRangecontroller propertylinstor_controller_tcp_port_auto_range_end: The end of the range set on theTcpPortAutoRangecontroller property
On the implementation side, these metrics should probably be gauges. I took a quick look at the implementation of LINSTOR's Metrics handler, and it seems like we can already fetch those properties from the CtrlApiCallHandler.
Do you guys think it makes sense to have those two new metrics?
I don't see how this metrics would help, as they would be static?
I think what you rather want, would be a metric that tells you the count of "free" tcpports from the autorange.
The metric wouldn't be static because it can change depending on how we set the TcpPortAutoRange in the controller. I guess that depends on what we mean be static, but I would consider it dynamic.
Yes, the end goal is to be able to view how many free TCP ports we have from the autorange. From what I understand, this can be computed from linstor_controller_tcp_port_auto_range_end - linstor_controller_tcp_port_auto_range_start - linstor_resource_definition_count. Do you think that assumption is correct?
I was going on the route of just exposing the raw config values in the metrics endpoint because the computation of the final metric seems simple enough to be done on the client side (Grafana), and having the raw values might open up other use cases for those metrics. But for this specific use case, I think both styles (raw config values vs calculated count of free ports) work just fine.
Since you would have to be able to extend TcpPortAutoRange to avoid running out of assignable port numbers, can you explain why you would add unnecessary complexity and potential for failure in the first place by initially configuring TcpPortAutoRange to be too small, knowing that you will have to reserve more port numbers for the future extension of TcpPortAutoRange, only to actually activate that range later? This makes no sense to me. TcpPortAutoRange is supposed to be configured for the maximum number of resources that the system needs to support from the start.
I agree that the correct approach is to do a proper capacity planning to find an appropriate value for TcpPortAutoRange. Still, there's a limit to what we can plan for. Things can change throughout the cluster's lifespan (addition of storage capacity, change in workload profiles, node failures, evacuation of nodes, etc.). We can plan for all those known unknowns, but there are always unknown unknowns that can impact the resource usage.
I understand your point in the sense that it's very weird to have to change TcpPortAutoRange, and that definitely is a result of either misconfiguration or a mistake in the capacity planning from the users' part. Nonetheless, this port limit exists, and this scenario is still possible. LINSTOR acknowledges the possibility of such a scenario and offers means to remediate it, as described in the guide.
The way I see it today is that, much like the total storage capacity (reported by linstor_storage_pool_capacity_total_bytes), the value of TcpPortAutoRange represents an upper bound limit for the number of resources in the LINSTOR cluster. As such, we'd find it very valuable to have this information alongside the other metrics of the cluster. I strongly believe that this information would give more visibility of this limit to operators who are not yet familiar with LINSTOR. Moreover, this would also enable us to create alerts regarding the exhaustion of ports, enabling users to take preventive action before errors start to occur.
I understand that every new feature adds complexity and maintenance costs to the project. My proposal of exposing the raw upper and lower config values aims to reduce the complexity on LINSTOR's side while maximizing the flexibility for clients (they can both calculate the number of free ports as well as display the raw values on the dashboard).