nginx-service-mesh grpc traffic allowed when accessControlMode=deny

I deployed NSM with accessControlMode=deny, and gRPC still works perfectly.

ACTUAL RESULTS

Transactions work with gRPC, when they should fail.

EXPECTED

I expected that traffic would be forbidden. This is the case with HTTP, but not with gRPC:

<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx/1.21.6</center>
</body>
</html>

STEPS (General)

Deploy NSM
Deploy server application
Deploy client application

Redeploy NSM with accessControlMode=deny and verify

kubectl delete --namespace "nginx-mesh" \
  $(kubectl get pods \
    --namespace "nginx-mesh" \
    --selector "app.kubernetes.io/name=nginx-mesh-api" \
    --output name)
# verify deny
nginx-meshctl config | jq -r .accessControlMode

Attempt gRPC traffic member within service mesh

STEPS (Specific)

These are steps from within my project, but really anything similar should work. I used grpcurl to build this:

URLS=(https://docs.nginx.com/nginx-service-mesh/examples/{prometheus,grafana,otel-collector,jaeger}.yaml)
for URL in ${URLS[*]}; do curl -sOL $URL; done
for FILE in {prometheus,grafana,otel-collector,jaeger}.yaml; do kubectl apply -f $FILE; done

cat << EOF > nsm.yaml
# nsm.yaml
prometheusAddress: prometheus.nsm-monitoring.svc:9090
telemetry:
  exporters:
    otlp:
      host: otel-collector.nsm-monitoring.svc
      port: 4317
  samplerRatio: 1
tracing: null
mtls:
  mode: permissive
autoInjection:
  disable: false
  disabledNamespaces:
    - nsm-monitoring
EOF

cat << EOF > dgraph.yaml
# dgraph.yaml
image:
  tag: v21.03.2
alpha:
  configFile:
    config.yaml: |
      security:
        whitelist: "0.0.0.0/0"
EOF

cat << EOF > pydgraph.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pydgraph-client
spec:
  replicas: 1
  selector:
    matchLabels:
      app: pydgraph-client
  template:
    metadata:
      labels:
        app: pydgraph-client
    spec:
      containers:
      - name: pydgraph-client
        image: darknerd/pydgraph-client:latest
        env:
        - name: DGRAPH_ALPHA_SERVER
          value: dgraph-dgraph-alpha.dgraph.svc.cluster.local
        resources:
          requests:
            memory: "64Mi"
            cpu: "80m"
          limits:
            memory: "128Mi"
            cpu: "250m"
EOF

###################
# deploy mesj
########################
helm repo add nginx https://helm.nginx.com/stable
helm install nginx-mesh nginx-stable/nginx-service-mesh --values nsm.yaml

###################
# deploy server to mesh
########################
kubectl get namespace "dgraph" > /dev/null 2> /dev/null \
 || kubectl create namespace "dgraph" \
 && kubectl label namespaces "dgraph" name="dgraph"

helm repo add dgraph https://charts.dgraph.io

helm template dgraph dgraph/dgraph --values dgraph.yaml \
 | nginx-meshctl inject \
     --ignore-incoming-ports 5080,7080 \
     --ignore-outgoing-ports 5080,7080 \
 | kubectl apply --namespace "dgraph" --filename

###################
# deploy client to mesh
########################
kubectl get namespace "pydgraph-client" > /dev/null 2> /dev/null \
 || kubectl create namespace "pydgraph-client" \
 && kubectl label namespaces "pydgraph-client" name="pydgraph-client"

cat pydgraph-client \
  | nginx-meshctl inject \
  | kubectl apply --namespace "pydgraph-client" --filename -

###################
# set accessControlMode=deny
########################
cat << EOF > nsm_deny .yaml
# nsm.yaml
prometheusAddress: prometheus.nsm-monitoring.svc:9090
telemetry:
  exporters:
    otlp:
      host: otel-collector.nsm-monitoring.svc
      port: 4317
  samplerRatio: 1
tracing: null
mtls:
  mode: permissive
autoInjection:
  disable: false
  disabledNamespaces:
    - nsm-monitoring
accessControlMode: deny
EOF

# patch configuration
helm install nginx-mesh nginx-stable/nginx-service-mesh --values nsm.yaml
# delete pod using old configuration
kubectl delete --namespace "nginx-mesh" \
  $(kubectl get pods --namespace "nginx-mesh" --selector "app.kubernetes.io/name=nginx-mesh-api" --output name)

# few seconds later, verify configuration is now deny
nginx-meshctl config | jq -r .accessControlMode

###################
# exec into client
########################
export CLIENT_NAMESPACE="pydgraph-client"
PYDGRAPH_POD=$(kubectl get pods --namespace $CLIENT_NAMESPACE --output name)
kubectl exec -ti --container "pydgraph-client" --namespace $CLIENT_NAMESPACE ${PYDGRAPH_POD} -- bash

###################
# NOTE: this is run in the pydgraph-client container
########################

# verify HTTP doesn't work (these fail as expected)
curl --silent ${DGRAPH_ALPHA_SERVER}:8080/health
curl --silent ${DGRAPH_ALPHA_SERVER}:8080/state

# verify GRPC doesn't work (these will succeed, not expected)
grpcurl -plaintext -proto api.proto ${DGRAPH_ALPHA_SERVER}:9080 api.Dgraph/CheckVersion

# GRPC using python works as well, should not work:
python3 load_data.py --plaintext --alpha ${DGRAPH_ALPHA_SERVER}:9080 \
  --files ./sw.nquads.rdf --schema ./sw.schema

Sep 17 '22 23:09 darkn3rd

Our gRPC support does not have feature parity with HTTP. The TrafficTarget objects do not effect gRPC traffic. This deserves an explanation in the documentation.

We chose this path, in part, to seemlessly adopt a specific gRPC SMI traffic spec as it was developed. See here. In lieu of SMI support, we will more clearly document the lack of gRPC support.

Sep 20 '22 20:09 f5yacobucci

Note that according to the docs linked:

gRPC - there should be a gRPC-specific traffic spec. As part of the first version, this has been left out as HTTPRouteGroup can be used in the interim.

Thus the HTTPRouteGroup could be used in the interim, but unfortunately with NSM's implementation, this is not supported. Given that adding an integrated ingress controller with kubernetes-ingress (aka NGINX+ and NGINX Ingress Controllers) requires that all services that need an ingress have to be a part of the mesh, this increases the attack surface unnecessarily.

Setting the NGINX SM to deny for the accessControlMode config and using HTTPRouteGroup to restrict traffic to sensitive services on the mesh, would be one way to mitigate such risk.

However, given this feature does not actually work, at least with gRPC, the solution becomes difficult to recommend professionally in its current implementation. This especially affects backend databases like Dgraph that communicate in gRPC, but I imagine other microservices as well that more commonly communicate in gRPC for dramatic performance improvements. gRPC is extremely popular with microservices and distributed databases or other stateful services.

Sep 20 '22 20:09 darkn3rd

I would like to make this a feature request for gRPC, because without it, as using ingress controller forces NGINX to be requires putting everything on the mesh, so being able to lock this down would be nice. If the SMI will take to long to come out with the full spec (like 1+ year), then there needs to be something in the interim, e.g. GRPCRoute, e.g. https://github.com/kubernetes-sigs/gateway-api/blob/main/site-src/geps/gep-1016.md, otherwise wait it is available officially if it will come out in the next few months.

Oct 04 '22 16:10 darkn3rd

Commenting to avoid auto bots closing issue.

SMI looks to be on hold until Gateway and SMI merge efforts. In the meantime, having some CRD (even if priority) would be ideal until SMI/GatewaySIG come up with solution. Other SM like Linkerd that intend to use the standard are doing this as far as I can tell, and will deprecate their proprietary CRDs once standard comes up with official CRD.

Nov 20 '22 00:11 darkn3rd

nginx-service-mesh nginx-service-mesh copied to clipboard

grpc traffic allowed when accessControlMode=deny

ACTUAL RESULTS

EXPECTED

STEPS (General)

STEPS (Specific)

nginx-service-mesh
nginx-service-mesh copied to clipboard