GPT-NeoX, GPT-J and BLOOM do not produce consistent results for the integration test
Still need to narrow this down, but I suspect that
- Windows x86-64 6 threads (what I generated with)
- Linux x86-64 1 thread (what the tests run with)
- macOS M1 1 thread (what I'm testing with now)
all return different results for these models. What I'm amazed by is LLaMA and MPT working! Are these deterministic under the right conditions? Should we just regenerate with the values for the CI and hope for the best?
I've worked around this by disabling the relevant equality test in https://github.com/rustformers/llm/commit/c188a10fc41fea50472f16e2e5248c8daee3920c (so that it just attempts inference and doesn't check the result), but it would be nice to get to the bottom of this.
Just a stab in the dark (idk how this crate works, just lurking), but could the equality tests be comparing floats that are too precise? The differences between each test run could just be accumulated error at the higher precisions of those numbers.
Hey, thanks for the suggestion! Unfortunately, we check that the inferred output under most-probable sampling matches, not the raw logits: https://github.com/rustformers/llm/blob/main/binaries/llm-test/configs/llama.json#L9
Our logit calculations for these models do not appear to be consistent between platforms at all - if it was just a numerical precision issue, the same tokens should still be selected, but that doesn't seem to be the case :(
After playing around with gpu acceleration i believe that the inference code of these models has some errors and accesses uninitialized memory somewhere meaning the results are a bit corrupted by random values.