inference Llama3.1-405B non-zero temperature

In the reference implementation of Llama3.1-405B, temperature is set to 1. Is this intentional?

Normally, temperature should be set to zero for outputs for be more deterministic. Can an optimized implementation use a different temperature?

Feb 18 '25 14:02 psyhtest

N.B: We are aware of an open vLLM issue, due to which setting temperature to zero still results in non-determinism. Maybe we will need to recalibrate reference accuracy for the next round (v5.1).

Feb 18 '25 14:02 psyhtest

Inference WG 18/Feb/2025: multiple parties run the reference implementation and obtained identical results. Maybe it's a by-product of topk=1. Optimized submissions should use the same parameters.

Feb 18 '25 17:02 psyhtest

Does a temperature of 0 result in non-determinism?

A common case of confusion is if a temperature of 0 generates non-deterministic replies. In theory, yes. In practice, no.

As noted by this OpenAI Forum Thread, achieving non-determinism is impossible. A temperature of 0 does force the SoftMax function to choose the most likely response—which is the definition of greedy sampling and is non-deterministic. However, LLMs are not run in a vacuum; race conditions of multi-threaded code impacts the established likelihoods of tokens. Consequently, while temperature reduces randomness to a minimum, it doesn’t eliminate them.

However, the randomness is minimized to the extent that developers can expect near non-determinism. For most queries that specify the structure of the expected output, this reduction in randomness is sufficient.

Am I alone in thinking that by "non-determinism" the authors of this article actually mean "determinism"? Surely, "the randomness is minimized" should mean "near determinism"?

Feb 21 '25 14:02 psyhtest

It's like flammable vs inflammable.

Feb 21 '25 14:02 psyhtest