squbs icon indicating copy to clipboard operation
squbs copied to clipboard

Potential meter count issues in metrics flow

Open anilgursel opened this issue 7 years ago • 1 comments

The logic for response metrics with https://github.com/paypal/squbs/pull/553 is as follows:

    val outbound = Flow[RequestContext].map { rc =>
      rc.attribute[Timer.Context](requestTime) map {_.stop()}

      rc.response map {
        case Success(response) =>
          metrics.meter(responseCount).mark()
          val statusCode = response.status.intValue()

          if (statusCode >= 500) metrics.meter(count5XX).mark()
          else if (statusCode >= 400) metrics.meter(count4XX).mark()
          else if (statusCode >= 300) metrics.meter(count3XX).mark()
          else if (statusCode >= 200) metrics.meter(count2XX).mark()
        case Failure(ex) =>
          metrics.meter(responseCount).mark()
          val causeExceptionClass = Try(ex.getCause.getClass.getSimpleName) getOrElse ex.getClass.getSimpleName
          metrics.meter(MetricRegistry.name(domain, s"$name-$causeExceptionClass-count")).mark()
      }

      rc
    }

I see some potential issues in metrics collection code.

Server Side Metrics

case Failure(ex)

For the case Failure(ex) => scenario, does the the server actually send a response back to the client?

  • If yes, what is the response code?
    • If always 5xx, we should increment 5xx counter along with responseCount.
    • If not always 5xx, we either need to somehow find the response code, or have a different category of metrics, e.g., UNKNOWN.
  • If no, then metrics.meter(responseCount).mark() should be removed. responseCount is, IMO, how many responses are sent back to client. While debugging an issue, if I see this metric, I would assume server responded this many number of times (not how many times the code executed through the pipeline, at the end response count being increment in pipeline is an implementation detail).

rc.response being None

In FlowHandler.scala, I see the following:

val responseFlow = b.add(Flow[RequestContext].map { rc =>
        rc.response map {
          case Success(response) => response
          case _ => InternalServerError
        } getOrElse NotFound
      })

It looks like we would return NotFound to the client if response is None but not increment 4xx counter. I think NotFound is not true though. It might actually be None because of a bug in the code (potentially somewhere in the pipeline), so would return InternalServerError and increment 5xx.

404 scenario when a webContext does not match is not visible

I see the following in FlowHandler

// Last output port is for 404
partition.out(flowWrappers.size) ~> notFound ~> routesMerge

The pipeline is per squbs-service If no webcontext is matched to a request, we return 404; however, the pipeline is not installed in that scenario, so no metrics is incremented. Anyway, the metrics are webcontext specific, so that's fine. However, for better visibility of the system, we should probably keep another metric specifically for 404 when webcontext not matched scenario. Please note, 404 can happen in two ways:

  • A webcontext is matched but the route could not match
  • A webcontext did not match.

The second one is invisible at this point.

Client Side Metrics

case Failure(ex)

case Failure(ex) => scenario means that we did not receive a response from the server, right? Something bad happened before we received a response. Why do we increment response count then? Is there any scenario where the server actually returned a response and response is still a Failure?

rc.response being None

Having a None would most likely be because of a bug in the code. I think we should log it as a warning or error. There should not be any scenario where response is None. Please see

def pipelineAdapter(clientConnectionFlow: ClientConnectionFlow[RequestContext]):
  Flow[RequestContext, RequestContext, HostConnectionPool] = {
    val fromRc = Flow[RequestContext].map { rc => (rc.request, rc) }
    val toRc = Flow[(Try[HttpResponse], RequestContext)].map {
      case (responseTry, rc) => rc.copy(response = Option(responseTry))
    }

    fromRc.viaMat(clientConnectionFlow)(Keep.right).via(toRc)
  }

anilgursel avatar Jan 31 '18 07:01 anilgursel

@anilgursel I think you have covered the gaps well. We may identify others over time but I cannot think of anything else just yet. We need to make sure we build test cases for each of these scenarios.

akara avatar Feb 01 '18 00:02 akara