rpki-validator-3 icon indicating copy to clipboard operation
rpki-validator-3 copied to clipboard

validation failure scenarios and it's impact on RTR

Open lukastribus opened this issue 3 years ago • 5 comments

Hello,

I'm currently evaluating RPKI RP's and have a few questions.

I'm concerned about bugs, misconfigurations or other issues (in all RP/RTR setups) that will cause obsolete VRP's on the production routers, because I believe this is the worst-case in RPKI ROV deployments.

I worry about:

  • crash bugs in the validation code
  • hangs during RPKI validation (even in rsync), that block the entire validation
  • memory allocation failures (failed malloc)
  • Linux OOM-killer (probably killing the process with the largest amount of memory usage)
  • admin mistakes

and how those impact the RTR service:

The best-case scenario in my mind is that the RTR server goes down completely and all RTR sessions die, so that the production routers are aware there is a problem with that RTR end point and stop using it (failing over to other RTR servers, if available).

According to the wiki:

The RPKI-RTR server is a separate daemon, that allows routers to connect using the RPKI-RTR protocol. It's set up as a separate instance because not everyone needs to run this, but more importantly, if you do need to run this then a separate daemon allows one to run more than one instance for redundancy (it keeps state even when the validator is down).

So it seems like expected behavior is keeping the RTR server online as long as possible, is that correct? How would we avoid serving obsolete VRP's to production routers in this case?

I'm also thinking about monitoring (other than parsing logs):

I'd say the api/validation-runs/latest-successful API provides all those informations, so we can query those informations and then feed an external monitoring system with it.

lukastribus avatar Sep 16 '20 10:09 lukastribus