dex icon indicating copy to clipboard operation
dex copied to clipboard

Accumulating new AuthRequest objects can cause a denial of service on storage.

Open pjmichael opened this issue 7 years ago • 29 comments

We are running dex in a kubernetes cluster using kubernetes CRD's as our storage implementation. We recently had came across a scenario in our testing in which a misconfiguration resulted in the storage of a large number of authorization requests.

The implications of this scenario with our implementation was that:

  • Each authorization request would result in the creation of a new AuthRequest object, hard coded with an expiry of 24 hours (see: https://github.com/dexidp/dex/blob/master/server/handlers.go#L163)
  • Even though garbage collection would run every 5 minutes, AuthRequest objects continued to accumulate for 24 hours (ie. as they have not yet expired)
  • Garbage collection lists all AuthRequest objects without pagination or filtering at the start of each execution
  • Eventually, the list AuthRequests call began to timeout, causing Garbage collection to fail
  • By the time AuthRequest objects began to expire they we not being cleaned up as Garbage collection could no longer list AuthRequests successfully
  • The flow on effect was to cause a degradation of the kubernetes apiserver (ie. the storage implementation)

Do you have any thoughts as to how this issue could be avoided?

Some possible ideas I was thinking which may help were:

  • making AuthRequest expiry configurable to reduce period in which they could accumulate prior to them being garbage collected
  • perhaps adding rate limiting on the creation of AuthRequest objects

pjmichael avatar Sep 07 '18 03:09 pjmichael

Thanks for reporting this, and including all that information. 👍

  • making AuthRequest expiry configurable to reduce period in which they could accumulate prior to them being garbage collected
  • perhaps adding rate limiting on the creation of AuthRequest objects

I suppose that making the "totally arbitrary value" configurable is a great idea, and a low-hanging fruit. OTOH, it's also a bit of a band-aid...

Looking at the GC code, it think this is an issue specific to the k8s CRD usage -- the SQL query is likely much more efficient. Also, I know next to nothing about CRD, so I don't know if there's ways to tweak its garbage collection code in any way...as you mention, pagination could be one thing (no idea if that's possible).

I think some kind of rate limiting would be good, even regardless of this specific bug. (However, I also wonder if this wouldn't often be handled by some reverse proxy sitting in front of Dex.)

For the AuthRequest generation, the references issue above that totally arbitrary value seems related, too: #646. If less AR were generated, this could relieve the pressure, too.

srenatus avatar Sep 10 '18 07:09 srenatus

I think the crux of the issue is "prevent unauthenticated users from creating too much backend state."

Rate limiting would be good, but overall we probably want to tie the auth request to an IP or something. The proposal in #646 is one way, but doesn't stop a motivated attacker from messing with your cluster.

Removing "help wanted" for now. I don't think there's a clear enough solution to warrant the label.

ericchiang avatar Sep 10 '18 15:09 ericchiang

I suppose that making the "totally arbitrary value" configurable is a great idea, and a low-hanging fruit. OTOH, it's also a bit of a band-aid...

TBH I would prefer a band-aid over downtime.

mxey avatar Dec 11 '18 17:12 mxey

@mxey help wanted or not, exposing that as a configurable seems like a welcome PR to me. 😃

srenatus avatar Dec 12 '18 07:12 srenatus

I bet there's a way we can get away from storing state until we know a user's authenticated by using a cookie. The naive way would be to serialize the AuthRequest, sign it using dex's signing keys, and shoving it in the user's session:

https://godoc.org/github.com/coreos/dex/storage#AuthRequest

Then only persist the AuthRequest once we get a valid response from the backend (e.g. the LDAP server), once we'd fill in the Claims and ConnectorData.

Overall, I think breaking up the AuthRequest into different classes of data would help forward that effort. Even just adding comments to the struct about what fields are generated by the initial request, filled in once the user's authenticated with a backend, and any other data that's accumulated on the way. It was probably wrong to lump all that data together in the first place.

ericchiang avatar Dec 13 '18 18:12 ericchiang

by using a cookie.

One thing to consider here is that cookies are per browser, not per tab. With the current way of redirecting with a req=.. parameter, you can have to concurrent login flows in one browser, using cookies, this could become more complicated.

As an alternative, we might be able to pass the authreq as a signed JWT instead -- so it wouldn't be ?req=auth_req_id but ?req=auth_req_as_jwt. We wouldn't have to store anything in the database, and we'd keep stuff working "per-tab". Potential caveats are header sizes (redirecting using a url that has the auth req jwt in it) -- but since the auth req contains little more than the connector id, some expiry, header and signature, I don't think that'll become an issue.

srenatus avatar Dec 16 '18 11:12 srenatus

With the current way of redirecting with a req=.. parameter, you can have to concurrent login flows in one browser, using cookies, this could become more complicated.

I would expect the application you are logging in to with OIDC to be using session cookies anyway.

mxey avatar Dec 18 '18 09:12 mxey

@mxey indeed -- we do the same. But I'm a little uneasy with imposing this on everyone...

On a related note, I don't think my previous proposal (JWT authreq) makes the need for rate limiting go away: right now, unauthenticated requests cause database entries; with the proposal, they'd cause some calculations (checksums for the JWT signatures). So, it's still worth limiting the impact of what an unauthenticated peer can do... 🤔

srenatus avatar Dec 18 '18 09:12 srenatus

We're being hit by this now and then due to Apache Flink Admin UIs we have put behind authentication with oauth2proxy and dex, with a SAML IdP. The Admin UI will auto-refresh, and once all the authentication expires, it will continue to auto-refresh and cause a new authrequest every time it does so, eventually filling up etcd :-(

Just my two cents on ways this bug makes life harder.

forsberg avatar Jun 11 '19 09:06 forsberg

The cookies also do not prevent suffering from a deliberate filling of etcd by unauthenticated users. In our case our own black box exporter had created over 500,000 auth entries in less than 24 hours.

gavinmcnair avatar Aug 08 '19 16:08 gavinmcnair

We just got hit by this issue pretty bad using the kubernetes storage backend.

In one of our cluster we ended up with 260K object for authrequest.dev.coreos.com CRD causing loads of problems as you can imagine

primeroz avatar Nov 28 '19 13:11 primeroz

We have reached 500K objects of authrequests and issues started. Is there an update to the Dex GC code to limit the "list" of authresuests so he can at least clean it slowly? How did you clean all objects when the GC is dead?

IliaGershenzon avatar Jan 08 '20 14:01 IliaGershenzon

@IliaGershenzon

The only way to clear the object was to act directly on the storage layer, in our case since it was kubernetes we just deleted all the authrequest.dev.coreos.com resources , if you are using a different backend just find those objects and delete them

in our case we had a "misconfiguration" on the healthcheck of some OIDC proxy that was causing every healthcheck to create an authrequest. reaching 500K objects of authrequests means something is creating login requests and never finishing them ... this must be a misconfiguration somewhere ( or a ddos ? :) ) If you look at the actual content of those authrequests you might see the client / path and understand what is creating them.

Make sure you find the source of your issue or cleaning up will be pointless

primeroz avatar Jan 08 '20 14:01 primeroz

@primeroz Yeah, found the source of the mess... It was a proxy someone deployed on top of the Kiali (Istio). Dex should add to the GC code "?limit=500" :0 Thanks !

IliaGershenzon avatar Jan 08 '20 15:01 IliaGershenzon

Is there any real solution for this issue? Right now anyone can make a curl loop and bring the cluster down

githubrotem avatar Jan 12 '20 16:01 githubrotem

Not a real solution but we just decided to move the storage away from kubernetes to its own backend, this way worst that can happen is the dex service get ddos and not the kubernetes cluster.

primeroz avatar Jan 13 '20 08:01 primeroz

FWIW i took similar precautions, moving storage to a dedicated etcd cluster (of 1 node) to restrain impact of this bug to that etcd cluster instead of the main k8s API. This works out for me so far.

pieterlange avatar Jan 13 '20 11:01 pieterlange

Anyone have some tips to locate whats hammering our dex? Tried to shut down the services I thought was the culprit, but to no help.

Kyrremann avatar Jan 31 '20 10:01 Kyrremann

@Kyrremann if you describe / look into the authrequest you will see informations about the client that created the request , that might help you identify the offender

good luck!

primeroz avatar Jan 31 '20 10:01 primeroz

I've tried that, but maybe I misunderstand something, or is kubeflow-authservice-oidc the offender?

$ k describe authrequest zzvnxk7ejxydqnsbnvpusrm5k
Name:         zzvnxk7ejxydqnsbnvpusrm5k
Namespace:    kubeflow
Labels:       <none>
Annotations:  <none>
API Version:  dex.coreos.com/v1
Claims:
  Email:           
  Email Verified:  false
  User ID:         
  Username:        
Client ID:         kubeflow-authservice-oidc
Expiry:            2020-01-31T16:50:52.200156942Z
Kind:              AuthRequest
Logged In:         false
Metadata:
  Creation Timestamp:  2020-01-30T16:50:52Z
  Generation:          1
  Resource Version:    61443521
  Self Link:           /apis/dex.coreos.com/v1/namespaces/kubeflow/authrequests/zzvnxk7ejxydqnsbnvpusrm5k
  UID:                 11eb53c7-0772-4242-b1eb-145226daa34c
Redirect URI:          https://kubeflow.some.site/login/oidc
Response Types:
  code
Scopes:
  openid
  profile
  email
  groups
State:   BX15XBrv
Events:  <none>

I do have a lot of

2020/01/31 10:13:30 xxx.xxx.xxx.xx /notebook/opendata-dev/opendata-dev/api/sessions?1580465610725 Cookie not set, redirecting to login.

in the logs from authservice.

Kyrremann avatar Jan 31 '20 10:01 Kyrremann

yeah the redirect URI is sort of the client that initiated the auth request , is the uri where an authenticated request will be redirected to

primeroz avatar Jan 31 '20 10:01 primeroz

I'm wondering if the issue is because we don't have internal load balancer, so we're redirecting between two ports externally when login (if I remember/understand our network correctly). So it may be the refresh token, or something similar who triggers all these calls.

Kyrremann avatar Jan 31 '20 11:01 Kyrremann

Is this still an active issue with the latest Dex code?

marcjimz avatar May 24 '22 22:05 marcjimz

we are still seeing this issue with v2.24.0

aaron-arellano avatar Aug 18 '22 16:08 aaron-arellano

Hey, is it still the case? Anyone working on this?

CC'ing @sagikazarmark for awareness.

Dentrax avatar Dec 27 '22 14:12 Dentrax

Was there any update or fix for this issue? Or maybe a workaround? Im facing a similar issue.

bella-WI avatar May 08 '23 11:05 bella-WI

The problem can occur only due to the high rate of authentication requests. The best way to solve the issue is to protect the authentication endpoint with the rate-limiting solution, e.g., a reverse proxy with the feature.

There is an example of how you can protect your authentication endpoint with the ingress-nginx in Kubernetes:

---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: dex
  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: HTTPS
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - dex.example.com
    secretName: ingress-tls
  rules:
  - host: dex.example.com
    http:
      paths:
      - path: /
        pathType: ImplementationSpecific
        backend:
          service:
            name: dex
            port:
              number: 443
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: dex-auth
  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: HTTPS
    nginx.ingress.kubernetes.io/limit-rpm: "20"
    # Works only for ingress-controllers >=0.40. It is here to not forget to add the annotation after upgrading ingress controller.
    nginx.ingress.kubernetes.io/limit-burst-multiplier: "2"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - dex.example.com
    secretName: ingress-tls
  rules:
  - host: dex.example.com
    http:
      paths:
      - path: /auth
        pathType: ImplementationSpecific
        backend:
          service:
            name: dex
            port:
              number: 443

One more thing: do not forget to set the lifetime for auth requests (expiry.authRequests) to something like 10m. However, will affect the time users can stay on the login page.

nabokihms avatar Aug 07 '23 22:08 nabokihms

@nabokihms, thanks for the proposed solution. Let me ask a few questions about this.

  1. Why two separate Ingress? I don't really understand the purpose of the first one.
  2. Why port 443 and not 5556?
  3. Do I need to use certificates: create a ClusterIssuer and add the line cert-manager.io/cluster-issuer: cert-issuer to the annotations in Ingress?

yurkoff-mv avatar Nov 29 '23 11:11 yurkoff-mv

You need to use the port of your service. Two ingresses are required to distinct auth requests from other requests. As for certificates, it is up to you 🙂

nabokihms avatar Jan 26 '24 17:01 nabokihms