Llama3.1-405B non-zero temperature
In the reference implementation of Llama3.1-405B, temperature is set to 1. Is this intentional?
Normally, temperature should be set to zero for outputs for be more deterministic. Can an optimized implementation use a different temperature?
N.B: We are aware of an open vLLM issue, due to which setting temperature to zero still results in non-determinism. Maybe we will need to recalibrate reference accuracy for the next round (v5.1).
Inference WG 18/Feb/2025: multiple parties run the reference implementation and obtained identical results. Maybe it's a by-product of topk=1. Optimized submissions should use the same parameters.
Does a temperature of 0 result in non-determinism?
A common case of confusion is if a temperature of 0 generates non-deterministic replies. In theory, yes. In practice, no.
As noted by this OpenAI Forum Thread, achieving non-determinism is impossible. A temperature of 0 does force the SoftMax function to choose the most likely response—which is the definition of greedy sampling and is non-deterministic. However, LLMs are not run in a vacuum; race conditions of multi-threaded code impacts the established likelihoods of tokens. Consequently, while temperature reduces randomness to a minimum, it doesn’t eliminate them.
However, the randomness is minimized to the extent that developers can expect near non-determinism. For most queries that specify the structure of the expected output, this reduction in randomness is sufficient.
Am I alone in thinking that by "non-determinism" the authors of this article actually mean "determinism"? Surely, "the randomness is minimized" should mean "near determinism"?
It's like flammable vs inflammable.