sgx-lkl Need sensible default for number of ethreads

Problem

The current default for the number of ethreads is 1, as defined here. This is not a good idea, as by default there would be no true parallelism when executing an application.

Solution

We should decide on what the right default should be.

Before the EEID support, the number of ethreads defaulted to the number of (virtual) CPU cores when the launcher is executed. This had the advantage that applications executed with the maximum level of parallelism.

Setting the default dynamically though makes attestation less predictable, as the precise enclave_config will depend on the host where SGX-LKL is launched.

A conservative alternative would be to pick a larger default value of, say, 4, but this would still require users to increase it with larger CPU core counts.

(@wintersteiger @jxyang @letmaik, do you have any views on this?)

Aug 04 '20 06:08 prp

This number is part of the attested config, i.e. it can't possibly depend on the host, because that would be unpredictable for anyone who wants to verify a quote later. I don't think we need a default that is in any way optimal; users can easily change this (in the config), so IMHO the default of 1 is perfectly fine.

I don't think it's a requirement for the host to actually start up exactly as many threads as the num_tcs setting (we have to expect the host to be malicious anyways). So, theoretically we could just pick a large number (but not too large to be a memory problem), and let the host pick its favorite number. I'm not in favor of this though, it feels like it might get us into a grey area of security problems :-)

Aug 04 '20 09:08 wintersteiger

This number is part of the attested config, i.e. it can't possibly depend on the host, because that would be unpredictable for anyone who wants to verify a quote later. I don't think we need a default that is in any way optimal; users can easily change this (in the config), so IMHO the default of 1 is perfectly fine.

I think that a default of 1 is a bad idea, as most users will leave this unchanged, and then be surprised that the performance of their application is very poor.

As long as a user understands about the host environment where their SGX-LKL instance runs, they will know what the number of CPU cores is. This would mean that they can predict the enclave_config. If a user is uncomfortable with that, they could always fix the number of ethreads to a particular value.

I don't think it's a requirement for the host to actually start up exactly as many threads as the num_tcs setting (we have to expect the host to be malicious anyways). So, theoretically we could just pick a large number (but not too large to be a memory problem), and let the host pick its favorite number. I'm not in favor of this though, it feels like it might get us into a grey area of security problems :-)

Picking a number of ethreads that is larger than the number of CPU cores will degrade performance quite quickly. Keep in mind that any context switch between ethreads done by the host OS will result in an AEX and thus incur a high overhead.

Aug 04 '20 09:08 prp

then be surprised...

If a user doesn't understand that their application is multi-threaded... :-)

As long as a user understands...

They may not know what type of VM their container is launched on, especially when their containers are launched as part of higher level workflows. I think it's best to always have them fix a number that works for them.

degrade performance quite quickly.

You read my statement back to front. We can fix the number of TCSs to a large number, but have the host start a smaller number of threads. If OE's limit is 32, always pick 32, but the host can chose to run only 4 and leave the rest unused.

Aug 04 '20 10:08 wintersteiger

They may not know what type of VM their container is launched on, especially when their containers are launched as part of higher level workflows. I think it's best to always have them fix a number that works for them.

I don't think that there is a way around this: if a user wants to have good performance, and they want the number of ethreads to be an attested setting, they need to know the VM type that they will run on.

degrade performance quite quickly.

You read my statement back to front. We can fix the number of TCSs to a large number, but have the host start a smaller number of threads. If OE's limit is 32, always pick 32, but the host can chose to run only 4 and leave the rest unused.

Ok, but this means that the number of ethreads is no longer an attested setting? Since it determines the actual concurrency inside the enclave, this would come with security implications.

We should think of num_ethreads and num_TCS as two separate settings. The TCS setting simply gives an upper bound on true concurrency, so picking a large default here would be fine.

Aug 04 '20 12:08 prp

Yes, that's right, num_tcs would be an upper bound (currently num_tcs == num_ethreads).

Actually, do we want the number of ethreads be an attested setting? This depends entirely on the malicious host and they can start as little or many ethreads they want. We don't have any way to enforce that in the enclave, do we?

Aug 04 '20 12:08 wintersteiger

I think that we can check the number of ethreads, as they all need to enter the enclave at the start. Since each should occupy a separate TCS, the enclave can distinguish them and fail if it doesn't get the expected number. Obviously the enclave cannot check if they are executing concurrently, but it gets an upper bound on the level of concurrency that is possible.

I am worried about attacks in which a host runs an an enclave application with an unusually large number of ethreads to trigger rare race conditions.

Aug 04 '20 12:08 prp

That's a good idea, we could use that upper bound to enforce the maximum (num_tcs).

What would happen if multiple malicious host-threads share one TCS so that the enclave can't detect them? Is there any way for them to trigger enclave behavior that a single thread can't trigger?

Aug 04 '20 12:08 wintersteiger

We have done threat analysis on TCS on OE side. There wasn't clear conclusion on how a malicious host could exploit that, with D.o.S. out of scope.

I'd prefer not to have num_tcs in the attested settings. It's hard for me to see how changing the tcs# could make the enclave application more or less secure.

Aug 06 '20 00:08 jxyang

personally I think the default should be the same as the number of CPUs on the machine. However if this is part of the attestation and so running on machines with different number of cores then I can understand. Personally if a user added it to configuration I would think that should be part of attestation. If it is not added to the configuration then we should default ethreads=cpus and it should not be part of the attestation.

Aug 06 '20 16:08 paulcallen

sgx-lkl sgx-lkl copied to clipboard

Need sensible default for number of ethreads

Problem

Solution

sgx-lkl
sgx-lkl copied to clipboard