veracruz icon indicating copy to clipboard operation
veracruz copied to clipboard

Plumb Mbed TLS' RNG into Veracruz's platform services

Open gbryant-arm opened this issue 2 years ago • 12 comments

Fix for https://github.com/veracruz-project/veracruz/issues/507 Updates https://github.com/veracruz-project/veracruz/pull/509

This is the longer-term solution for the entropy problem on Nitro. The C vendor code uses an external veracruz_getrandom() instead of the getrandom() syscall. veracruz_getrandom() is defined in the Rust wrapper code and calls the associated veracruz_getrandom(), defined in platform-services and in charge of getting the random bytes in the most suitable way depending on the underlying platform. The plumbing is done at the Cargo layer to avoid having to rely on hacks like passing flags to the linker, which has empirically been found tricky to get right (link order issues).

gbryant-arm avatar Aug 17 '22 17:08 gbryant-arm

This is the longer-term solution for the entropy problem on Nitro.

I think we should mention in the log or in a comment what the correct long-term solution is, which is to make getrandom work on Nitro, as I keep saying.

When we modify somebody else's crate we generally try to make useful changes that might eventually be accepted upstream. Obviously there is no chance whatsover of platform-services = { path = "../../../platform-services" } being accepted upstream. That's not even a dependency on Veracruz; it's a dependency on a weird relative path!

If we really can't make getrandom work and really have to provide random numbers from outside these crates, the correct way to do that would be to pass in a pointer to the random number generator when initialising mbedtls. Probably not worth implementing that, but this dependency on path = "../../../platform-services" is really horrible and should definitely be optional, selected by a feature, even if we choose to use it on all Veracruz platforms.

egrimley-arm avatar Aug 17 '22 20:08 egrimley-arm

I really don't want rust-mbedtls to unconditionally depend on Veracruz stuff

platform-services is a self-contained crate independent from the rest of the Veracruz codebase so there is no real backward dependency. What's the issue exactly? Veracruz is a portable platform for confidential compute, in that regard, I don't find it weird to link external dependencies against it. Though we should avoid changing them too much and having to work downstream.

I think we should mention in the log or in a comment what the correct long-term solution is, which is to make getrandom work on Nitro, as I keep saying.

Agreed, this is not the longest-term solution fixing the root cause. The entropy pool on Nitro should be seeded automatically, whenever necessary. How do we do that? We can't seed periodically because we don't know how much entropy is needed, nor when it is needed. We have to intercept getrandom() somehow, maybe by using our own custom libc. If anyone has suggestions I encourage them to add them to #507.

When we modify somebody else's crate we generally try to make useful changes that might eventually be accepted upstream.

Aren't we past the point of no return on that? @egrimley-arm You have made significant changes to upstream rust-mbedtls, is it realistic to expect them to get merged anytime soon? Fwiw those changes broke the examples in rust-mbedtls/mbedtls/examples.

That's not even a dependency on Veracruz; it's a dependency on a weird relative path!

Isn't that how we define path-based dependencies throughout the entire codebase? See sdk/generate-policy/Cargo.toml for instance. Could we pass the path via an environment variable?

If we really can't make getrandom work and really have to provide random numbers from outside these crates, the correct way to do that would be to pass in a pointer to the random number generator when initialising mbedtls

Sounds like a good idea. Feel free to add it to #507 :+1:

gbryant-arm avatar Aug 18 '22 10:08 gbryant-arm

We can't seed periodically because we don't know how much entropy is needed, nor when it is needed.

There is no need to seed periodically. Once it has been seeded, getrandom(buf, buflen, 0) will never block. (Or at least that's my understanding; I've said it several times and no one has contradicted me!)

We have to intercept getrandom() somehow, maybe by using our own custom libc.

Why do you think that?

When we modify somebody else's crate we generally try to make useful changes that might eventually be accepted upstream.

Aren't we past the point of no return on that?

I don't believe there is any such point.

Nearly all of the changes I've made are upgrading the C code in mbedtls-sys/vendor and various consequences of that. So a lot depends on whether and when upstream wants to upgrade.

The next biggest category is changes to make it work on IceCap. Those changes could either be made a bit more generic ("work without a file system") and contributed upstream as a feature option, or they could be removed if we are sure we do not want to use IceCap or anything like it in future (though currently IceCap is still tested by CI tests).

Fwiw those changes broke the examples in rust-mbedtls/mbedtls/examples.

That would be very easy to fix if anyone wants those examples. (In fact, one of the examples was already broken upstream.)

That's not even a dependency on Veracruz; it's a dependency on a weird relative path!

Isn't that how we define path-based dependencies throughout the entire codebase?

It's the way Veracruz depends on third-party, but I don't think we've ever made a third-party ("vendored") crate depend on Veracruz in that way, so it's a new thing, I think (@mathias-arm would know).

Could we pass the path via an environment variable?

Probably not.

I think, for now, we should just make this dependency on Veracruz optional, selected by a feature. That also works as a kind of documentation because it's easy to see which bits of code belong to the modification.

egrimley-arm avatar Aug 18 '22 10:08 egrimley-arm

There is no need to seed periodically. Once it has been seeded, getrandom(buf, buflen, 0) will never block. (Or at least that's my understanding; I've said it several times and no one has contradicted me!)

Entropy initialisation is one issue. Entropy shortage is another one that you can hit on Nitro when you open too many TLS connections in a short time, for instance. The latter is a real issue I had when benchmarking Veracruz-Nitro, although I can't say if you can reproduce it on any Nitro machine.

Nearly all of the changes I've made are upgrading the C code in mbedtls-sys/vendor and various consequences of that. So a lot depends on whether and when upstream wants to upgrade.

The build scripts are quite different as well (https://github.com/veracruz-project/rust-mbedtls/issues/1).

I think, for now, we should just make this dependency on Veracruz optional, selected by a feature. That also works as a kind of documentation because it's easy to see which bits of code belong to the modification.

Hmm that sounds a little excessive. Is there a point in making a bug fix optional?

gbryant-arm avatar Aug 18 '22 11:08 gbryant-arm

Entropy shortage is another one that you can hit on Nitro when you open too many TLS connections in a short time, for instance. The latter is a real issue I had when benchmarking Veracruz-Nitro, although I can't say if you can reproduce it on any Nitro machine.

It would be interesting, of course, to know what's going wrong there, but I would guess that it's not caused by getrandom(buf, buflen, 0) blocking, because, as I understand it, that call cannot work once and block later.

The build scripts are quite different as well (veracruz-project/rust-mbedtls#1).

That's a small change that upstream might want to use, isn't it, or do I misunderstand what you're referring to?

Hmm that sounds a little excessive. Is there a point in making a bug fix optional?

But it's not a bug fix; it's a work-around for a bug in another component, surely?

egrimley-arm avatar Aug 18 '22 11:08 egrimley-arm

It would be interesting, of course, to know what's going wrong there, but I would guess that it's not caused by getrandom(buf, buflen, 0) blocking, because, as I understand it, that call cannot work once and block later.

What matters here is whether getrandom() succeeds or not, which, as I understand it, is directly related to the state of the entropy pool.

That's a small change that upstream might want to use, isn't it, or do I misunderstand what you're referring to?

Yes that's what I'm referring to. If it does make it easier to update the vendor code then that would be interesting indeed.

But it's not a bug fix; it's a work-around for a bug in another component, surely?

AFAIU it's a bug/limitation in Nitro. Something that we can't fix directly. I haven't opened a ticket yet.

gbryant-arm avatar Aug 18 '22 12:08 gbryant-arm

What matters here is whether getrandom() succeeds or not, which, as I understand it, is directly related to the status of the entropy pool.

What exactly do you mean? Do you agree that it is impossible for getrandom(buf, buflen, 0) to succeed on one occasion but fail on a subsequent occasion?

AFAIU it's a bug/limitation in Nitro. Something that we can't fix directly. I haven't opened a ticket yet.

I think we perhaps could fix it by running something beforehand to initialise the kernel's RNG, but I don't know how hard that would be because I have never looked at the OS under Nitro: what init system there is and so on.

egrimley-arm avatar Aug 18 '22 12:08 egrimley-arm

Do you agree that it is impossible for getrandom(buf, buflen, 0) to succeed on one occasion but fail on a subsequent occasion?

I don't. As mentioned above, I've seen programs successfully getting RN until getrandom() failed, supposedly due to entropy shortage.

I think we perhaps could fix it by running something beforehand to initialise the kernel's RNG, but I don't know how hard that would be because I have never looked at the OS under Nitro: what init system there is and so on.

We could do that but that wouldn't fix the entropy shortage issue. The entropy pool must be refilled once in a while. The command line executed in Nitro is defined in runtime-manager/dockerdir/Dockerfile

gbryant-arm avatar Aug 18 '22 12:08 gbryant-arm

Do you agree that it is impossible for getrandom(buf, buflen, 0) to succeed on one occasion but fail on a subsequent occasion?

I don't. As mentioned above, I've seen programs successfully getting RN until getrandom() failed, supposedly due to entropy shortage.

The man page seems to state that that cannot happen. It seems rather unlikely that Nitro would have a kernel that has been hacked to make getrandom behave differently.

I assume we are still talking about getrandom(buf, buflen, 0); note the third argument.

I think we perhaps could fix it by running something beforehand to initialise the kernel's RNG, but I don't know how hard that would be because I have never looked at the OS under Nitro: what init system there is and so on.

We could do that but that wouldn't fix the entropy shortage issue. The entropy pool must be refilled once in a while.

Who should I believe, you or the man page?

Have you read the man page? Do you disagree with my interpretation of it? If you are quite sure that you have discovered a mismatch between the man page and the behaviour of the Linux kernel, particularly in such an important area, you should definitely report it to the kernel developers.

egrimley-arm avatar Aug 18 '22 12:08 egrimley-arm

The man page seems to state that that cannot happen

The man page mentions several times the case where "no random bytes are available", so I'd say that can happen. By the way, by "fail" I mean that getrandom() whether hangs for a long time or forever or doesn't return enough/any random bytes. Through that lens, the GRND_NONBLOCK flag is irrelevant to this discussion. I should also mention that on the Nitro machine failing, passing GRND_RANDOM or GRND_NONBLOCK to getrandom() doesn't seem to make any difference: getrandom() hangs regardless. Let me reproduce the experiment.

It seems rather unlikely that Nitro would have a kernel that has been hacked to make getrandom behave differently.

The man page (https://man7.org/linux/man-pages/man2/getrandom.2.html) describes the Linux specifications of the getrandom syscall, but in practice it is directly impacted by the state of the entropy pool, which is something that might have been hacked.

gbryant-arm avatar Aug 18 '22 13:08 gbryant-arm

It sounds as though we disagree about what the man page says, though you are not communicating very clearly:

The man page seems to state that that cannot happen

The man page mentions several times the case where "no random bytes are available", so I'd say that can happen.

Perhaps you are answering a different question from the one I asked? (What do you think "that" refers to?)

Of course, if you have seen getrandom(buf, buflen, GRND_NONBLOCK) block then clearly we are dealing with a very weird version of getrandom, which is possible: someone might have written some entirely different function and called it getrandom. Or perhaps there's a bug in an old version of some C library. You could check with strace, perhaps? getrandom(buf, buflen, GRND_NONBLOCK) blocking, despite the flag, would stick out like a sore thumb in the output from strace.

If getrandom on Nitro is totally weird/broken, then you should produce a test case and report it.

egrimley-arm avatar Aug 18 '22 13:08 egrimley-arm

See https://man7.org/linux/man-pages/man2/getrandom.2.html:

       By default, getrandom() draws entropy from the urandom source
       (i.e., the same source as the /dev/urandom device).  [...]

       If the urandom source has been initialized, reads of up to 256
       bytes will always return as many bytes as requested [...]

       If the urandom source has not yet been initialized, then
       getrandom() will block, unless GRND_NONBLOCK is specified in
       flags.

       [...]

       GRND_RANDOM
              If this bit is set, then random bytes are drawn from the
              random source (i.e., the same source as the /dev/random
              device) instead of the urandom source.  The random source
              is limited based on the entropy that can be obtained from
              environmental noise.

I take that to mean that getrandom without GRND_RANDOM is NOT "limited based on the entropy that can be obtained from environmental noise". But there is also other online information about the getrandom system call on Linux, for example the LWN article: https://lwn.net/Articles/606141/

If we can't agree on what getrandom does, or should do, then it's hard to know how we should proceed.

egrimley-arm avatar Aug 18 '22 13:08 egrimley-arm

Is this PR still relevant?

dreemkiller avatar Oct 21 '22 10:10 dreemkiller

Not anymore

gbryant-arm avatar Oct 21 '22 10:10 gbryant-arm