Feature Request: Admin Token Recovery
Problem
If a user loses their admin-level token, there is currently no way to regenerate it. This is problematic for the implied reasons: data cannot be accessed.
We should not recover the token, but a new token should be able to be generated to overwrite the lost admin-token.
Ideas
Below are some different approaches I've thought of, in the stack ranked order I'd suggest. They are not mutually exclusive though, for example safe mode and recovery code could work well together..
Recovery Code (One-Time Use, Privately Stored)
During install or first token creation, the system generates a one-time recovery code (a long, randomized string) and shows it once to the user, with a clear “store this securely” warning. This code is a recovery code, not a token itself. It can't be used to access any data.
This code is hashed (standard security processes) and stored server-side. If the admin token is lost, a user can issue a token recovery command, passing this recovery code, by flag or by CLI entry. If it matches the stored hash, the system allows a one-time token overwrite, and outputs a new recovery code. Usage of the recovery code is logged.
Safe Mode
We introduce a one-time “recovery mode” triggered via CLI flag or environment variable (e.g., --safe-mode). When the server starts with this flag:
- It starts in isolated mode with no external access, no writing, no querying, no processing, no compacting, no useful mode.
- Only a local CLI call can issue a new admin token.
- Once used, the flag must be removed and the system restarted normally
Secondary Tool for Regeneration
One idea is to have a secondary tool that can only be received from InfluxData. This tool would, in some way, be able to bypass the locked-down security environment, and replace the admin token with a new version.
- Downside: Hard to supply from InfluxData for Core users given amount. Perhaps that's a simple download from our website, but for Enterprise users, there's the additional security of using the licensed email in some way to provide a special download of the tool; e.g., backend we store a random hash tied to licensed email, and we supply them that hash which goes into secondary tool.
Other thoughts
For Enterprise users, we could try leveraging their license email somehow, but unsure how that would play in for all instances, and doesn't solve the problem for Core users. May also be untenable for Home Enterprise users in all areas. Plus, could compromise security in some way if email access is gained as well.
For review by @influxdata/monolith-team
cc @jdstrand
Thanks @peterbarnett03 for writing this up.
Recovery code sounds like a good idea and it's not been suggested before, but what happens if they lose the recovery code too? I think we need to be clear in the message that the tokens and recovery codes are a one time thing and if the user loses it then user cannot access db. The current message is very much exactly what was used in alpha and I think it could do better to warn the user that they need to keep the token safely.
There is one other option I mentioned in this slack convo, it's actually a slight variant to something @jdstrand proposed (although I cannot find the source of that convo now - on a flakey airport wifi). Basically, mount regenerate endpoint itself on the loopback address (original proposal by @jdstrand). This would again have to be something that user opts into when starting the server. We could do few variants of this too, i.e when this is running don't allow any writes (or allow writes) etc. So, it requires restarting the server but regenerate endpoint will be accessible on that interface without requiring a token.
I don't mind which option we choose to implement by the way, I just wanted to capture this other option as well as part of this issue.
Agree on wording and messaging. PR #26336 for that portion to clearly communicate the importance of storing. When we decide on path forward, we may want to update it again.
Recovery Code (One-Time Use, Privately Stored)
This doesn't seem any different than giving out the token in the first place, which already says to keep it safe and store it somewhere secure.
One idea is to have a secondary tool that can only be received from InfluxData. This tool would, in some way, be able to bypass the locked-down security environment, and replace the admin token with a new version.
Let's not do this. We don't want to introduce backdoors into the system.
We introduce a one-time “recovery mode” triggered via CLI flag or environment variable (e.g., --safe-mode)
Something like this is ok. If you are in a position to restart the server, you could do various disruptive things (including tampering with the catalog) so from a security POV, it is 'ok'. @praveen-influx outlined some ideas on how to make this work on a technical level.
For Recovery Code, I could see us perhaps emailing those to the licensed user; that ensures they're kept somewhere (unless if deleted), while also not granting token-like access unless if input as a serve arg. But doesn't solve for Core use cases, and not sure on best practices in general there; I don't feel compelled to use that approach after thinking through it more.
Researching other approaches systems use, it feels like if you have access to the filesystem itself, then you should be able to reset the token; perhaps by manual edit of a file, or a --safe-mode approach. Don't want to push direct file editing unless if we move it out of the catalog in someway. Curious on @praveen-influx's thoughts on the technical side you mentioned.
I would like to add another functionality idea that was introduced to v2 but is still missing as of my knowledge in v3.
When having access to the underlying data source and the actual physical storage it is possible to recover user credentials in v2 using the influxd recovery utility (https://docs.influxdata.com/influxdb/v2/admin/users/recover-credentials/)
As of now there is currently no way to do this v3. This has already been proven to be secure since you have access to the data in its physical representation.
Of course this won't solve the issue if no access to the physical data is present as described by the author of this issue, but would still be a nice addition considering this is already present in earlier influxdb versions.
influxd recovery
Note that we are in the process of removing the storage of the raw token data in InfluxDB 2.x. Part of that work will involve the ability to create a new token; this all will be done on the machine running the service. I bring this up because while we don't want to store raw token data in any products, we do want to the ability for people to reset them.
@peterbarnett03 I can implement the --safe-mode approach such that regenerate endpoint is mounted without requiring any auth.
-
Do we need to stop any writes, queries etc? Reason I ask is, if it is just to help recover a lost operator token and we trust that the other db tokens etc are still valid then we can just let the server run accepting writes, queries etc as long as a valid token is passed in and allow regenerate without even restarting the server. You would have to add a command that is listening only on loop back address or a passed in interface and once that's ran we can allow subsequent call to regenerate work without token. If we decide that
--safe-modeshould only work by restarting the server due to security reasons (@jdstrand?) that's ok, but wanted to clarify. -
I also think keeping the tokens within catalog is better, externalizing it to a separate file is probably not good especially with multi server scenarios.
Do we need to stop any writes, queries etc? Reason I ask is, if it is just to help recover a lost operator token and we trust that the other db tokens etc are still valid then we can just let the server run accepting writes, queries etc as long as a valid token is passed in and allow regenerate without even restarting the server. You would have to add a command that is listening only on loop back address or a passed in interface and once that's ran we can allow subsequent call to regenerate work without token. If we decide that --safe-mode should only work by restarting the server due to security reasons (@jdstrand?) that's ok, but wanted to clarify.
I think least surprise would say that --safe-mode restricts access in some manner since doing otherwise opens the server up to attack during the regeneration window (restarting the server is already a disruptive operation). It somewhat depends on how this is going to be implemented:
- if
--safe-modedisables all authz, for security we must limit to listening on loopback or similar (if we didn't, I think this would constitute a CVE) - if
--safe-modekeeps authz generally but opens up the endpoint for resetting the token (without authz), for security we could choose to still limit to listening on loopback (or similar) for everything, but better would be to accept (with authz) for all but the regeneration endpoint where we only accept requests from loopback (or similar) -
--safe-modecould keep authz for everything but open up the endpoint for resetting the token (without authz) on a new (configurable) port. By default we use something like 127.0.0.1:8182 but allow the user to specify another listening address and port. I quite like this approach since it has the right balance of default security, is cross-platform and accommodates managed environments -
--safe-modecould keep authz for everything but open up a UNIX domain socket for resetting the token. This is nice from a security POV but likely problematic in managed environments and something else would need to be done for Windows
I would think --safe-mode would disable all reads and writes. The purpose I'm thinking of is that it acts as a full gate for the most security sensitive operation. Over time, I would think in --safe-mode we could enable more resets of tokens and batch updating. In the event many tokens are exposed at once.
I am not certain though if there may be UX issues here. Getting into internal of a Docker environment without interactive mode turned on can be difficult (if I recall). I also am unsure if this would create a massive burden on something like Timestream, which may want a more API-based way to regenerate the token. So the "managed environments" piece that @jdstrand mentions could really benefit form this.
@jdstrand Is the idea that the port is only opened for internal network access, so a managed solution would have access to it, but external wouldn't?
I think I like #3 the most if it fulfills that need.
--safe-mode could keep authz for everything but open up the endpoint for resetting the token (without authz) on a new (configurable) port. By default we use something like 127.0.0.1:8182 but allow the user to specify another listening address and port. I quite like this approach since it has the right balance of default security, is cross-platform and accommodates managed environments
That's definitely the option I was thinking of as well, a configurable interface and/or port so that users can choose to expose this endpoint when running in --safe-mode. This although I was assuming reads/writes and every other operation to be still working and not blocked. I think @peterbarnett03 's idea here is to only expose the regenerate endpoint on a user specified interface/port shutting down all reads/writes.
To clarify, I think the option you mentioned right there (@jdstrand's suggestion on a new port) is fine as is without changes to what can/can't get through. But I don't know if we should call that --safe-mode since that's not really anything different. If anything it's adding more usability.
I would think
--safe-modewould disable all reads and writes. The purpose I'm thinking of is that it acts as a full gate for the most security sensitive operation. Over time, I would think in--safe-modewe could enable more resets of tokens and batch updating. In the event many tokens are exposed at once.
Note, --safe-mode is only needed for regenerating the first admin token (aka, operator token named _admin) which is meant to not be deleteable; all other tokens (admin or not) don't require --safe-mode and can simply be deleted/recreated.
I am not certain though if there may be UX issues here. Getting into internal of a Docker environment without interactive mode turned on can be difficult (if I recall). I also am unsure if this would create a massive burden on something like Timestream, which may want a more API-based way to regenerate the token. So the "managed environments" piece that @jdstrand mentions could really benefit form this.
@jdstrand Is the idea that the port is only opened for internal network access, so a managed solution would have access to it, but external wouldn't?
I think I like #3 the most if it fulfills that need.
Yes, the 3rd option with a (configurable but) default of listening on 127.0.0.1:8182 (or similar) for just the admin token regeneration endpoint (and leaving 8181 open with authz) allows minimal disruption during the operator token regeneration with reasonable security:
- self-managed users who log into the machine to restart with
--safe-modecan use the default setting of 127.0.0.1:8182. This lets all operations on port 8181 continue to work, but also binds to the loopback address of 127.0.0.1 and port 8182 for the regeneration endpoint so the user can then callinfluxdb3 create token --admin --regenerateto regenerate. In this manner, the regeneration endpoint is reasonably protected sinceinfluxdbisn't reachable over the (non-loopback) network - operators in the managed environment (assume that
influxdbis running on 10.11.12.13:8181) can restart with something like--safe-mode --regenerate-endpoint-listen 10.11.12.13:8182. This lets all operations on port 8181 continue to work, but also binds to the internal address of 10.11.12.13 and port 8182 so an operator in the managed environment can then callinfluxdb3 create token --admin --regenerate --host https://10.11.12.13:8182. In this manner, the regeneration endpoint is protected to the degree that 10.11.12.13:8182 is protected.
This 3rd option allows flexibility and security for managed environments such as when the managed environment control plane has one network interface for its users and another for operators. Consider an AWS scenario where the user's instance has an internal IP of 10.11.12.13 that is bound to a public address and configured security groups for the user to get to it over the internet, but a different private network interface and IP of 192.168.12.13 for management operations that is never bound to a public address (or accessible to users). Then if AWS wants to regenerate the token for whatever reason, they can restart with --safe-mode --regenerate-endpoint-listen 192.168.12.13:8182, connect to it via their control plane (which is inaccessible to the user) to regenerate, then restart.
But I don't know if we should call that --safe-mode since that's not really anything different. If anything it's adding more usability.
That's a fair point. I mentioned something like --safe-mode --regenerate-endpoint-listen 192.168.12.13:8182. Perhaps this simply becomes --regenerate-endpoint-listen where with no arguments it listens on 127.0.0.1:8182 but it can also take an optional argument for the address and port. --regenerate-endpoint-listen is pretty clear, but there might be better wording.
Do we still want to allow regeneration to be possible when started with auth (requiring an admin token?). Or do we want to only expose this endpoint when started without auth on a user defined interface/port?
Do we still want to allow regeneration to be possible when started with auth (requiring an admin token?). Or do we want to only expose this endpoint when started without auth on a user defined interface/port?
Being able to regenerate by providing a valid admin token seems reasonable to me.