luminal icon indicating copy to clipboard operation
luminal copied to clipboard

Accuracy is low for examples/train_math_net with cuda

Open npuichigo opened this issue 1 year ago • 5 comments

cargo run --release --features cuda
Iter 20649 Loss: 6.49 Acc: 0.17
Iter 20650 Loss: 6.48 Acc: 0.17
Iter 20651 Loss: 6.49 Acc: 0.17
Iter 20652 Loss: 6.48 Acc: 0.17
Iter 20653 Loss: 6.49 Acc: 0.17
Iter 20654 Loss: 6.48 Acc: 0.17
Iter 20655 Loss: 6.47 Acc: 0.17
Iter 20656 Loss: 6.48 Acc: 0.17
Iter 20657 Loss: 6.48 Acc: 0.17
Iter 20658 Loss: 6.48 Acc: 0.17
Iter 20659 Loss: 6.48 Acc: 0.17
Iter 20660 Loss: 6.48 Acc: 0.17
Iter 20661 Loss: 6.47 Acc: 0.17
Iter 20662 Loss: 6.47 Acc: 0.17
Iter 20663 Loss: 6.47 Acc: 0.17

npuichigo avatar May 05 '24 07:05 npuichigo

Agreed I'm seeing the same thing. Will fix

jafioti avatar May 05 '24 14:05 jafioti

I've added a small PR with a temporary fix, which may indicate how the Cuda training (in general) is going wrong.

swfsql avatar Jun 24 '24 17:06 swfsql

Hmm very interesting, your changes trigger a copy-back of the data to cpu, rather than keep it on gpu. I wonder why that makes it accurate. Sorry I haven't gotten around to looking at this in-depth. I will have time this weekend to check it out, and access to a cuda machine.

jafioti avatar Jun 25 '24 03:06 jafioti

Could it be that the initial CudaCopyToDevice calls (made at the start of every iteration) are always overwriting the latest GPU weight values with the (static, initial) CPU weight values?

swfsql avatar Jun 25 '24 15:06 swfsql

I don't think so, ops don't get ran if the destination tensor is already produced, so the copy to device shouldn't be ran as long as the cuda buffers weren't getting deleted first

jafioti avatar Jun 28 '24 07:06 jafioti