influxdb icon indicating copy to clipboard operation
influxdb copied to clipboard

Feature Request: Admin Token Recovery

Open peterbarnett03 opened this issue 8 months ago • 13 comments

Problem

If a user loses their admin-level token, there is currently no way to regenerate it. This is problematic for the implied reasons: data cannot be accessed.

We should not recover the token, but a new token should be able to be generated to overwrite the lost admin-token.

Ideas

Below are some different approaches I've thought of, in the stack ranked order I'd suggest. They are not mutually exclusive though, for example safe mode and recovery code could work well together..

Recovery Code (One-Time Use, Privately Stored)

During install or first token creation, the system generates a one-time recovery code (a long, randomized string) and shows it once to the user, with a clear “store this securely” warning. This code is a recovery code, not a token itself. It can't be used to access any data.

This code is hashed (standard security processes) and stored server-side. If the admin token is lost, a user can issue a token recovery command, passing this recovery code, by flag or by CLI entry. If it matches the stored hash, the system allows a one-time token overwrite, and outputs a new recovery code. Usage of the recovery code is logged.

Safe Mode

We introduce a one-time “recovery mode” triggered via CLI flag or environment variable (e.g., --safe-mode). When the server starts with this flag:

  • It starts in isolated mode with no external access, no writing, no querying, no processing, no compacting, no useful mode.
  • Only a local CLI call can issue a new admin token.
  • Once used, the flag must be removed and the system restarted normally

Secondary Tool for Regeneration

One idea is to have a secondary tool that can only be received from InfluxData. This tool would, in some way, be able to bypass the locked-down security environment, and replace the admin token with a new version.

  • Downside: Hard to supply from InfluxData for Core users given amount. Perhaps that's a simple download from our website, but for Enterprise users, there's the additional security of using the licensed email in some way to provide a special download of the tool; e.g., backend we store a random hash tied to licensed email, and we supply them that hash which goes into secondary tool.

Other thoughts

For Enterprise users, we could try leveraging their license email somehow, but unsure how that would play in for all instances, and doesn't solve the problem for Core users. May also be untenable for Home Enterprise users in all areas. Plus, could compromise security in some way if email access is gained as well.

For review by @influxdata/monolith-team

peterbarnett03 avatar Apr 25 '25 17:04 peterbarnett03

cc @jdstrand

Thanks @peterbarnett03 for writing this up.

Recovery code sounds like a good idea and it's not been suggested before, but what happens if they lose the recovery code too? I think we need to be clear in the message that the tokens and recovery codes are a one time thing and if the user loses it then user cannot access db. The current message is very much exactly what was used in alpha and I think it could do better to warn the user that they need to keep the token safely.

There is one other option I mentioned in this slack convo, it's actually a slight variant to something @jdstrand proposed (although I cannot find the source of that convo now - on a flakey airport wifi). Basically, mount regenerate endpoint itself on the loopback address (original proposal by @jdstrand). This would again have to be something that user opts into when starting the server. We could do few variants of this too, i.e when this is running don't allow any writes (or allow writes) etc. So, it requires restarting the server but regenerate endpoint will be accessible on that interface without requiring a token.

I don't mind which option we choose to implement by the way, I just wanted to capture this other option as well as part of this issue.

praveen-influx avatar Apr 27 '25 13:04 praveen-influx

Agree on wording and messaging. PR #26336 for that portion to clearly communicate the importance of storing. When we decide on path forward, we may want to update it again.

peterbarnett03 avatar Apr 28 '25 02:04 peterbarnett03

Recovery Code (One-Time Use, Privately Stored)

This doesn't seem any different than giving out the token in the first place, which already says to keep it safe and store it somewhere secure.

One idea is to have a secondary tool that can only be received from InfluxData. This tool would, in some way, be able to bypass the locked-down security environment, and replace the admin token with a new version.

Let's not do this. We don't want to introduce backdoors into the system.

We introduce a one-time “recovery mode” triggered via CLI flag or environment variable (e.g., --safe-mode)

Something like this is ok. If you are in a position to restart the server, you could do various disruptive things (including tampering with the catalog) so from a security POV, it is 'ok'. @praveen-influx outlined some ideas on how to make this work on a technical level.

jdstrand avatar Apr 28 '25 11:04 jdstrand

For Recovery Code, I could see us perhaps emailing those to the licensed user; that ensures they're kept somewhere (unless if deleted), while also not granting token-like access unless if input as a serve arg. But doesn't solve for Core use cases, and not sure on best practices in general there; I don't feel compelled to use that approach after thinking through it more.

Researching other approaches systems use, it feels like if you have access to the filesystem itself, then you should be able to reset the token; perhaps by manual edit of a file, or a --safe-mode approach. Don't want to push direct file editing unless if we move it out of the catalog in someway. Curious on @praveen-influx's thoughts on the technical side you mentioned.

peterbarnett03 avatar Apr 28 '25 13:04 peterbarnett03

I would like to add another functionality idea that was introduced to v2 but is still missing as of my knowledge in v3. When having access to the underlying data source and the actual physical storage it is possible to recover user credentials in v2 using the influxd recovery utility (https://docs.influxdata.com/influxdb/v2/admin/users/recover-credentials/)

As of now there is currently no way to do this v3. This has already been proven to be secure since you have access to the data in its physical representation.

Of course this won't solve the issue if no access to the physical data is present as described by the author of this issue, but would still be a nice addition considering this is already present in earlier influxdb versions.

MatthiasWerning avatar Apr 28 '25 14:04 MatthiasWerning

influxd recovery

Note that we are in the process of removing the storage of the raw token data in InfluxDB 2.x. Part of that work will involve the ability to create a new token; this all will be done on the machine running the service. I bring this up because while we don't want to store raw token data in any products, we do want to the ability for people to reset them.

jdstrand avatar Apr 29 '25 21:04 jdstrand

@peterbarnett03 I can implement the --safe-mode approach such that regenerate endpoint is mounted without requiring any auth.

  • Do we need to stop any writes, queries etc? Reason I ask is, if it is just to help recover a lost operator token and we trust that the other db tokens etc are still valid then we can just let the server run accepting writes, queries etc as long as a valid token is passed in and allow regenerate without even restarting the server. You would have to add a command that is listening only on loop back address or a passed in interface and once that's ran we can allow subsequent call to regenerate work without token. If we decide that --safe-mode should only work by restarting the server due to security reasons (@jdstrand?) that's ok, but wanted to clarify.

  • I also think keeping the tokens within catalog is better, externalizing it to a separate file is probably not good especially with multi server scenarios.

praveen-influx avatar May 12 '25 16:05 praveen-influx

Do we need to stop any writes, queries etc? Reason I ask is, if it is just to help recover a lost operator token and we trust that the other db tokens etc are still valid then we can just let the server run accepting writes, queries etc as long as a valid token is passed in and allow regenerate without even restarting the server. You would have to add a command that is listening only on loop back address or a passed in interface and once that's ran we can allow subsequent call to regenerate work without token. If we decide that --safe-mode should only work by restarting the server due to security reasons (@jdstrand?) that's ok, but wanted to clarify.

I think least surprise would say that --safe-mode restricts access in some manner since doing otherwise opens the server up to attack during the regeneration window (restarting the server is already a disruptive operation). It somewhat depends on how this is going to be implemented:

  • if --safe-mode disables all authz, for security we must limit to listening on loopback or similar (if we didn't, I think this would constitute a CVE)
  • if --safe-mode keeps authz generally but opens up the endpoint for resetting the token (without authz), for security we could choose to still limit to listening on loopback (or similar) for everything, but better would be to accept (with authz) for all but the regeneration endpoint where we only accept requests from loopback (or similar)
  • --safe-mode could keep authz for everything but open up the endpoint for resetting the token (without authz) on a new (configurable) port. By default we use something like 127.0.0.1:8182 but allow the user to specify another listening address and port. I quite like this approach since it has the right balance of default security, is cross-platform and accommodates managed environments
  • --safe-mode could keep authz for everything but open up a UNIX domain socket for resetting the token. This is nice from a security POV but likely problematic in managed environments and something else would need to be done for Windows

jdstrand avatar May 13 '25 15:05 jdstrand

I would think --safe-mode would disable all reads and writes. The purpose I'm thinking of is that it acts as a full gate for the most security sensitive operation. Over time, I would think in --safe-mode we could enable more resets of tokens and batch updating. In the event many tokens are exposed at once.

I am not certain though if there may be UX issues here. Getting into internal of a Docker environment without interactive mode turned on can be difficult (if I recall). I also am unsure if this would create a massive burden on something like Timestream, which may want a more API-based way to regenerate the token. So the "managed environments" piece that @jdstrand mentions could really benefit form this.

@jdstrand Is the idea that the port is only opened for internal network access, so a managed solution would have access to it, but external wouldn't?

I think I like #3 the most if it fulfills that need.

peterbarnett03 avatar May 14 '25 13:05 peterbarnett03

--safe-mode could keep authz for everything but open up the endpoint for resetting the token (without authz) on a new (configurable) port. By default we use something like 127.0.0.1:8182 but allow the user to specify another listening address and port. I quite like this approach since it has the right balance of default security, is cross-platform and accommodates managed environments

That's definitely the option I was thinking of as well, a configurable interface and/or port so that users can choose to expose this endpoint when running in --safe-mode. This although I was assuming reads/writes and every other operation to be still working and not blocked. I think @peterbarnett03 's idea here is to only expose the regenerate endpoint on a user specified interface/port shutting down all reads/writes.

praveen-influx avatar May 14 '25 15:05 praveen-influx

To clarify, I think the option you mentioned right there (@jdstrand's suggestion on a new port) is fine as is without changes to what can/can't get through. But I don't know if we should call that --safe-mode since that's not really anything different. If anything it's adding more usability.

peterbarnett03 avatar May 14 '25 15:05 peterbarnett03

I would think --safe-mode would disable all reads and writes. The purpose I'm thinking of is that it acts as a full gate for the most security sensitive operation. Over time, I would think in --safe-mode we could enable more resets of tokens and batch updating. In the event many tokens are exposed at once.

Note, --safe-mode is only needed for regenerating the first admin token (aka, operator token named _admin) which is meant to not be deleteable; all other tokens (admin or not) don't require --safe-mode and can simply be deleted/recreated.

I am not certain though if there may be UX issues here. Getting into internal of a Docker environment without interactive mode turned on can be difficult (if I recall). I also am unsure if this would create a massive burden on something like Timestream, which may want a more API-based way to regenerate the token. So the "managed environments" piece that @jdstrand mentions could really benefit form this.

@jdstrand Is the idea that the port is only opened for internal network access, so a managed solution would have access to it, but external wouldn't?

I think I like #3 the most if it fulfills that need.

Yes, the 3rd option with a (configurable but) default of listening on 127.0.0.1:8182 (or similar) for just the admin token regeneration endpoint (and leaving 8181 open with authz) allows minimal disruption during the operator token regeneration with reasonable security:

  • self-managed users who log into the machine to restart with --safe-mode can use the default setting of 127.0.0.1:8182. This lets all operations on port 8181 continue to work, but also binds to the loopback address of 127.0.0.1 and port 8182 for the regeneration endpoint so the user can then call influxdb3 create token --admin --regenerate to regenerate. In this manner, the regeneration endpoint is reasonably protected since influxdb isn't reachable over the (non-loopback) network
  • operators in the managed environment (assume that influxdb is running on 10.11.12.13:8181) can restart with something like --safe-mode --regenerate-endpoint-listen 10.11.12.13:8182. This lets all operations on port 8181 continue to work, but also binds to the internal address of 10.11.12.13 and port 8182 so an operator in the managed environment can then call influxdb3 create token --admin --regenerate --host https://10.11.12.13:8182. In this manner, the regeneration endpoint is protected to the degree that 10.11.12.13:8182 is protected.

This 3rd option allows flexibility and security for managed environments such as when the managed environment control plane has one network interface for its users and another for operators. Consider an AWS scenario where the user's instance has an internal IP of 10.11.12.13 that is bound to a public address and configured security groups for the user to get to it over the internet, but a different private network interface and IP of 192.168.12.13 for management operations that is never bound to a public address (or accessible to users). Then if AWS wants to regenerate the token for whatever reason, they can restart with --safe-mode --regenerate-endpoint-listen 192.168.12.13:8182, connect to it via their control plane (which is inaccessible to the user) to regenerate, then restart.

jdstrand avatar May 14 '25 15:05 jdstrand

But I don't know if we should call that --safe-mode since that's not really anything different. If anything it's adding more usability.

That's a fair point. I mentioned something like --safe-mode --regenerate-endpoint-listen 192.168.12.13:8182. Perhaps this simply becomes --regenerate-endpoint-listen where with no arguments it listens on 127.0.0.1:8182 but it can also take an optional argument for the address and port. --regenerate-endpoint-listen is pretty clear, but there might be better wording.

jdstrand avatar May 14 '25 16:05 jdstrand

Do we still want to allow regeneration to be possible when started with auth (requiring an admin token?). Or do we want to only expose this endpoint when started without auth on a user defined interface/port?

praveen-influx avatar Jun 20 '25 08:06 praveen-influx

Do we still want to allow regeneration to be possible when started with auth (requiring an admin token?). Or do we want to only expose this endpoint when started without auth on a user defined interface/port?

Being able to regenerate by providing a valid admin token seems reasonable to me.

jdstrand avatar Jun 24 '25 18:06 jdstrand