veracruz
veracruz copied to clipboard
Plumb Mbed TLS' RNG into Veracruz's platform services
Fix for https://github.com/veracruz-project/veracruz/issues/507 Updates https://github.com/veracruz-project/veracruz/pull/509
This is the longer-term solution for the entropy problem on Nitro.
The C vendor code uses an external veracruz_getrandom()
instead of the getrandom()
syscall.
veracruz_getrandom()
is defined in the Rust wrapper code and calls the associated veracruz_getrandom()
, defined in platform-services
and in charge of getting the random bytes in the most suitable way depending on the underlying platform.
The plumbing is done at the Cargo layer to avoid having to rely on hacks like passing flags to the linker, which has empirically been found tricky to get right (link order issues).
This is the longer-term solution for the entropy problem on Nitro.
I think we should mention in the log or in a comment what the correct long-term solution is, which is to make getrandom
work on Nitro, as I keep saying.
When we modify somebody else's crate we generally try to make useful changes that might eventually be accepted upstream. Obviously there is no chance whatsover of platform-services = { path = "../../../platform-services" }
being accepted upstream. That's not even a dependency on Veracruz; it's a dependency on a weird relative path!
If we really can't make getrandom
work and really have to provide random numbers from outside these crates, the correct way to do that would be to pass in a pointer to the random number generator when initialising mbedtls
. Probably not worth implementing that, but this dependency on path = "../../../platform-services"
is really horrible and should definitely be optional, selected by a feature, even if we choose to use it on all Veracruz platforms.
I really don't want rust-mbedtls to unconditionally depend on Veracruz stuff
platform-services
is a self-contained crate independent from the rest of the Veracruz codebase so there is no real backward dependency. What's the issue exactly? Veracruz is a portable platform for confidential compute, in that regard, I don't find it weird to link external dependencies against it. Though we should avoid changing them too much and having to work downstream.
I think we should mention in the log or in a comment what the correct long-term solution is, which is to make getrandom work on Nitro, as I keep saying.
Agreed, this is not the longest-term solution fixing the root cause. The entropy pool on Nitro should be seeded automatically, whenever necessary. How do we do that? We can't seed periodically because we don't know how much entropy is needed, nor when it is needed. We have to intercept getrandom()
somehow, maybe by using our own custom libc. If anyone has suggestions I encourage them to add them to #507.
When we modify somebody else's crate we generally try to make useful changes that might eventually be accepted upstream.
Aren't we past the point of no return on that? @egrimley-arm You have made significant changes to upstream rust-mbedtls, is it realistic to expect them to get merged anytime soon? Fwiw those changes broke the examples in rust-mbedtls/mbedtls/examples
.
That's not even a dependency on Veracruz; it's a dependency on a weird relative path!
Isn't that how we define path-based dependencies throughout the entire codebase? See sdk/generate-policy/Cargo.toml
for instance. Could we pass the path via an environment variable?
If we really can't make getrandom work and really have to provide random numbers from outside these crates, the correct way to do that would be to pass in a pointer to the random number generator when initialising mbedtls
Sounds like a good idea. Feel free to add it to #507 :+1:
We can't seed periodically because we don't know how much entropy is needed, nor when it is needed.
There is no need to seed periodically. Once it has been seeded, getrandom(buf, buflen, 0)
will never block. (Or at least that's my understanding; I've said it several times and no one has contradicted me!)
We have to intercept
getrandom()
somehow, maybe by using our own custom libc.
Why do you think that?
When we modify somebody else's crate we generally try to make useful changes that might eventually be accepted upstream.
Aren't we past the point of no return on that?
I don't believe there is any such point.
Nearly all of the changes I've made are upgrading the C code in mbedtls-sys/vendor
and various consequences of that. So a lot depends on whether and when upstream wants to upgrade.
The next biggest category is changes to make it work on IceCap. Those changes could either be made a bit more generic ("work without a file system") and contributed upstream as a feature option, or they could be removed if we are sure we do not want to use IceCap or anything like it in future (though currently IceCap is still tested by CI tests).
Fwiw those changes broke the examples in
rust-mbedtls/mbedtls/examples
.
That would be very easy to fix if anyone wants those examples. (In fact, one of the examples was already broken upstream.)
That's not even a dependency on Veracruz; it's a dependency on a weird relative path!
Isn't that how we define path-based dependencies throughout the entire codebase?
It's the way Veracruz depends on third-party
, but I don't think we've ever made a third-party
("vendored") crate depend on Veracruz in that way, so it's a new thing, I think (@mathias-arm would know).
Could we pass the path via an environment variable?
Probably not.
I think, for now, we should just make this dependency on Veracruz optional, selected by a feature. That also works as a kind of documentation because it's easy to see which bits of code belong to the modification.
There is no need to seed periodically. Once it has been seeded, getrandom(buf, buflen, 0) will never block. (Or at least that's my understanding; I've said it several times and no one has contradicted me!)
Entropy initialisation is one issue. Entropy shortage is another one that you can hit on Nitro when you open too many TLS connections in a short time, for instance. The latter is a real issue I had when benchmarking Veracruz-Nitro, although I can't say if you can reproduce it on any Nitro machine.
Nearly all of the changes I've made are upgrading the C code in mbedtls-sys/vendor and various consequences of that. So a lot depends on whether and when upstream wants to upgrade.
The build scripts are quite different as well (https://github.com/veracruz-project/rust-mbedtls/issues/1).
I think, for now, we should just make this dependency on Veracruz optional, selected by a feature. That also works as a kind of documentation because it's easy to see which bits of code belong to the modification.
Hmm that sounds a little excessive. Is there a point in making a bug fix optional?
Entropy shortage is another one that you can hit on Nitro when you open too many TLS connections in a short time, for instance. The latter is a real issue I had when benchmarking Veracruz-Nitro, although I can't say if you can reproduce it on any Nitro machine.
It would be interesting, of course, to know what's going wrong there, but I would guess that it's not caused by getrandom(buf, buflen, 0)
blocking, because, as I understand it, that call cannot work once and block later.
The build scripts are quite different as well (veracruz-project/rust-mbedtls#1).
That's a small change that upstream might want to use, isn't it, or do I misunderstand what you're referring to?
Hmm that sounds a little excessive. Is there a point in making a bug fix optional?
But it's not a bug fix; it's a work-around for a bug in another component, surely?
It would be interesting, of course, to know what's going wrong there, but I would guess that it's not caused by getrandom(buf, buflen, 0) blocking, because, as I understand it, that call cannot work once and block later.
What matters here is whether getrandom()
succeeds or not, which, as I understand it, is directly related to the state of the entropy pool.
That's a small change that upstream might want to use, isn't it, or do I misunderstand what you're referring to?
Yes that's what I'm referring to. If it does make it easier to update the vendor code then that would be interesting indeed.
But it's not a bug fix; it's a work-around for a bug in another component, surely?
AFAIU it's a bug/limitation in Nitro. Something that we can't fix directly. I haven't opened a ticket yet.
What matters here is whether
getrandom()
succeeds or not, which, as I understand it, is directly related to the status of the entropy pool.
What exactly do you mean? Do you agree that it is impossible for getrandom(buf, buflen, 0)
to succeed on one occasion but fail on a subsequent occasion?
AFAIU it's a bug/limitation in Nitro. Something that we can't fix directly. I haven't opened a ticket yet.
I think we perhaps could fix it by running something beforehand to initialise the kernel's RNG, but I don't know how hard that would be because I have never looked at the OS under Nitro: what init system there is and so on.
Do you agree that it is impossible for getrandom(buf, buflen, 0) to succeed on one occasion but fail on a subsequent occasion?
I don't. As mentioned above, I've seen programs successfully getting RN until getrandom()
failed, supposedly due to entropy shortage.
I think we perhaps could fix it by running something beforehand to initialise the kernel's RNG, but I don't know how hard that would be because I have never looked at the OS under Nitro: what init system there is and so on.
We could do that but that wouldn't fix the entropy shortage issue. The entropy pool must be refilled once in a while.
The command line executed in Nitro is defined in runtime-manager/dockerdir/Dockerfile
Do you agree that it is impossible for getrandom(buf, buflen, 0) to succeed on one occasion but fail on a subsequent occasion?
I don't. As mentioned above, I've seen programs successfully getting RN until
getrandom()
failed, supposedly due to entropy shortage.
The man page seems to state that that cannot happen. It seems rather unlikely that Nitro would have a kernel that has been hacked to make getrandom
behave differently.
I assume we are still talking about getrandom(buf, buflen, 0)
; note the third argument.
I think we perhaps could fix it by running something beforehand to initialise the kernel's RNG, but I don't know how hard that would be because I have never looked at the OS under Nitro: what init system there is and so on.
We could do that but that wouldn't fix the entropy shortage issue. The entropy pool must be refilled once in a while.
Who should I believe, you or the man page?
Have you read the man page? Do you disagree with my interpretation of it? If you are quite sure that you have discovered a mismatch between the man page and the behaviour of the Linux kernel, particularly in such an important area, you should definitely report it to the kernel developers.
The man page seems to state that that cannot happen
The man page mentions several times the case where "no random bytes are available", so I'd say that can happen.
By the way, by "fail" I mean that getrandom()
whether hangs for a long time or forever or doesn't return enough/any random bytes. Through that lens, the GRND_NONBLOCK
flag is irrelevant to this discussion.
I should also mention that on the Nitro machine failing, passing GRND_RANDOM
or GRND_NONBLOCK
to getrandom()
doesn't seem to make any difference: getrandom()
hangs regardless. Let me reproduce the experiment.
It seems rather unlikely that Nitro would have a kernel that has been hacked to make getrandom behave differently.
The man page (https://man7.org/linux/man-pages/man2/getrandom.2.html) describes the Linux specifications of the getrandom
syscall, but in practice it is directly impacted by the state of the entropy pool, which is something that might have been hacked.
It sounds as though we disagree about what the man page says, though you are not communicating very clearly:
The man page seems to state that that cannot happen
The man page mentions several times the case where "no random bytes are available", so I'd say that can happen.
Perhaps you are answering a different question from the one I asked? (What do you think "that" refers to?)
Of course, if you have seen getrandom(buf, buflen, GRND_NONBLOCK)
block then clearly we are dealing with a very weird version of getrandom
, which is possible: someone might have written some entirely different function and called it getrandom
. Or perhaps there's a bug in an old version of some C library. You could check with strace
, perhaps? getrandom(buf, buflen, GRND_NONBLOCK)
blocking, despite the flag, would stick out like a sore thumb in the output from strace
.
If getrandom
on Nitro is totally weird/broken, then you should produce a test case and report it.
See https://man7.org/linux/man-pages/man2/getrandom.2.html:
By default, getrandom() draws entropy from the urandom source
(i.e., the same source as the /dev/urandom device). [...]
If the urandom source has been initialized, reads of up to 256
bytes will always return as many bytes as requested [...]
If the urandom source has not yet been initialized, then
getrandom() will block, unless GRND_NONBLOCK is specified in
flags.
[...]
GRND_RANDOM
If this bit is set, then random bytes are drawn from the
random source (i.e., the same source as the /dev/random
device) instead of the urandom source. The random source
is limited based on the entropy that can be obtained from
environmental noise.
I take that to mean that getrandom
without GRND_RANDOM
is NOT "limited based on the entropy that can be obtained from environmental noise". But there is also other online information about the getrandom
system call on Linux, for example the LWN article: https://lwn.net/Articles/606141/
If we can't agree on what getrandom
does, or should do, then it's hard to know how we should proceed.
Is this PR still relevant?
Not anymore