tiny-cuda-nn icon indicating copy to clipboard operation
tiny-cuda-nn copied to clipboard

Non-deterministic behavior of the hash embeddings update

Open anonymous-pusher opened this issue 1 year ago • 2 comments

Hello and thank you for the great work.

I have a problem regarding reproducibility of training using the cuda implementation of hash encoding. I noticed that between two runs initialized with the same seed and everything in torch set to deterministic:

    os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":16:8"
    os.environ["PYTHONHASHSEED"] = str(seed_number)
   
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    random.seed(seed_number)
    torch.cuda.manual_seed_all(seed_number)
    torch.use_deterministic_algorithms(True)
    torch.manual_seed(seed_number)
    torch.cuda.manual_seed(seed_number)
    np.random.seed(seed_number)

The update of the hash tables after a backward pass gives slightly different values for the embeddings. Freezing only the hash encoding lead to 100% reproducibility. Although the differences are small, they accumulate in the course of training and may lead to different results.

Is there a way to make the update deterministic in the same way how it's possible to be done in pytorch ?

Thank you

anonymous-pusher avatar Jul 24 '23 15:07 anonymous-pusher

I am facing a similar issue. Any word of advice from @Tom94 would help.

ashutoshmishra1014 avatar Aug 17 '23 20:08 ashutoshmishra1014

There's no feature in tiny-cuda-nn that makes this deterministic, sorry.

The reason that the backward pass through the hash encoding is non-deterministic in the first place is that colliding training samples sum up their gradients concurrently using atomic add instructions -- and the order of those additions depends on the order in which the GPU driver schedules its threads, which we have no control over. Now, mathematically, the order shouldn't matter but the way floating point numbers work there is a tiny difference that snowballs into larger discrepancies over multiple training iterations.

One way to implement a deterministic backward pass would be to serialize the gradient sum, which is easier said than done and would, even with a careful implementation, be likely quite a bit slower than the current approach using atomics.

Tom94 avatar Sep 26 '23 14:09 Tom94