PrometheusBasedResourceLimitChecks does not obey limits when the checked entry is not cached
PrometheusBasedResourceLimitChecks method isMessageLimitReached for example does not give any timeout for cache read needed situation on its code line 204 - this causes the byte sized payload limit to reset at the cache invalidate interval and the retrieve request will go to the else branch always and let payload pass through even over the limits.
Should this line 204 value.isDone() check be changed with something having a short wait timeout too? Or is this intentional not to wait at all for the cache entry getting available?
Should this line 204 value.isDone() check be changed with something having a short wait timeout too? Or is this intentional not to wait at all for the cache entry getting available?
Not waiting for the query to complete is intentional because we want to rather accept a few operations that would exceed the limit than blocking (and eventually failing) otherwise appropriate operations.
We could introduce a (very small) timeout instead of immediately falling back to accepting the operation but this value would need to be adapted to the concrete setup (and the expected responsiveness of Prometheus) by the admin. IMHO this will require quite some experience and testing to find a suitable value.
From our point of view this simply was not worth the effort when we created the code. WDYT?
The problem is that this occurs now every time the cache entry is not available, by default after every 60 seconds. And after this minute there is again accepted requests over the limit until new cache entry is present. At a very slow pace tenant this limit check will never trigger actually with current defaults.
But if this is intentional, would it be ok to add a configuration flag to change the behaviour, and keep this short wait period for waiting that missing cache entry gets created what I started to implement already? I'm not so sure does the timeout value need to be optimally short, some second or two would suffice in my opinion similar value as the whole Prometheus query timeout is, which I would rather rely on as the backoff timeout here too, the Prometheus query triggers timeout and any other errors there might arise sooner.
At a very slow pace tenant this limit check will never trigger actually with current defaults.
For such a client you will most likely not be concerned about exceeding the data limit ;-) However, on the other hand, if you have a tenant with thousands of devices that send data at quite a high rate, you will block (potentially) hundreds to thousands of device requests which pile up (in memory) waiting for the cache entry to become available.
I would be ok to add a configuration option for a duration that will be waited for the cache entry to become available instead of immediately falling back to the default value. This option would need to be off by default then and should be documented appropriately, i.e. describing the impact that enabling this option might have (as I described above). Would that work for you?
I need to admit that after pondering over the current implementation and the rationale for it, I've started to look into this issue from totally different perspective actually. And adding a new enforced shorter timeout value for the cache get isn't necessarily the best option for any of us.
But I didn't notice this tiny detail earlier on how Caffeine cache is initialised currently;
.expireAfterWrite(cacheTimeout)
.refreshAfterWrite(cacheTimeout.dividedBy(2));
And how do you see such an idea that refreshAfterWrite would be set to the current cacheTimeout value and expireAfterWrite would be some non-configurable hour(s) long duration instead? This would extend the limits exceeding cache entries live span, but also maintain the idea to reload cache entries after the configured cacheTimeout always.
And how do you see such an idea that
refreshAfterWritewould be set to the currentcacheTimeoutvalue andexpireAfterWritewould be some non-configurable hour(s) long duration instead?
I'd rather not change the semantics of the cacheTimeout propery. However, I guess it would be helpful to be able to configure the divisor to use for the refresh interval. That way, you could configure the cache timeout to a large value (e.g. hours) and still let the value be refreshed after a reasonably short amount of time (if accessed). WDYT?
However, I guess it would be helpful to be able to configure the divisor to use for the refresh interval. That way, you could configure the cache timeout to a large value (e.g. hours) and still let the value be refreshed after a reasonably short amount of time (if accessed). WDYT?
Yes this is something that suffices for my needs at least, to harden exceeding especially the transfer limit slightly.
I will prepare PR for this.
@harism can this be closed?
@harism can this be closed?
Yes on my behalf at least, I'm happy with the addition of suggested configurable cache timeout divisor which is now implemented already.